Semantic segmentation, a process of labelling all pixels of an image to various classes based on their properties, is a fundamental topic in computer vision. In recent years, deep networks, esp. convolutional neural networks (CNNs) have elevated the performance of semantic segmentation algorithms to new heights[1, 2, 3, 4, 5]. It has been shown that encoding rich contextual information helps achieve good results in semantic segmentation [5, 6, 4, 7]. However, due to the local connectivity of CNN layers, the size of receptive field is often too small to exploit sufficient features. Even in recent deep models, effective receptive field is still not large enough to cover the entire image .
To address this problem, many methods have been proposed to aggregate global contextual information from local features. Some [6, 5, 9, 10] utilize large filters, atrous spatial pyramid pooling (ASPP) or pyramid pooling module (PPM) to enlarge receptive fields and extract multi-scale features at the end of the network. Others [11, 7, 12] adopt spatial attention mechanism to capture long-range dependencies within the feature maps. The non-local block proposed in  has been widely applied for semantic segmentation as an instance of spatial attention mechanism to aggregate global contextual information densely. For each query element in a non-local block, all key elements are used to compute the pairwise relations with the query element and generate a dense affinity matrix. Then the contextual information at all locations is aggregated by a weighted sum with the weights defined by the affinity matrix. Although the non-local block improves the performance significantly, it requires high computational costs on high-end GPU-based platforms [13, 14]. Moreover, it is worth noting that some features may contain irrelevant information, thus aggregating these features is useless or even harmful to the performance.
Based on these observations, we propose a sparse non-local (SNL) block to aggregate contextual information from a sparse set of features, reducing the computational cost without sacrificing the performance. For each query element in the proposed SNL block, only a few key and value elements are sampled to compute a small sparse affinity matrix, and the sampling locations are learnable. To demonstrate the effectiveness of the proposed SNL block, we build a sparse spatial attention network (SSANet) based on the ResNet-FCN backbone with the SNL block integrated. The proposed network has been evaluated on various semantic segmentation benchmarks, including the Cityscapes , the PASCAL Context  and the ADE20K , and achieved state-of-the-art performances.
2 Related Work
Semantic Segmentation. Long et al. proposed the fully convolutional network  where the fully connected layers are replaced with convolutional layers to convert the semantic segmentation task into a pixel-level classification task. Chen et al. proposed a group of segmentation networks called DeepLab [6, 18, 3], which adopted atrous (dilated) convolutions to increase receptive fields and preserve spatial resolution of feature maps simultaneously. Zhao et al.  utilized a group of average pooling operations to encode multi-scale features. Another popular method is the encoder-decoder structure [19, 2], composed of an encoder and a symmetric decoder, where different level features are extracted in each stage of the encoder and the image resolution is recovered in the decoder step by step.
Contextual Information. Exploring contextual information helps improve performance in semantic segmentation due to multiple scales of objects in images and long-range dependencies between different locations. In , an atrous spatial pyramid pooling (ASPP) module was proposed to encode multi-scale contextual features by several parallel atrous convolutions of different rates. While in , a pyramid pooling module was adopted at the end of the model to exploit multi-level contextual information by pooling operations.
Spatial Attention. Spatial attention mechanism can be considered as a contextual feature aggregation method, which models the long-range dependencies between each pair of elements in the feature maps and then gathers contextual information to each location. Wang et al.  proposed the non-local block and integrated it in deep networks as a self-attention module to capture long-range dependencies for video classification. In , a spatial and a channel attention modules were adopted to capture relations and re-weight each location and channel, respectively. Zhao et al.  proposed a two-branch attention block to collect and distribute information separately.
3.1 Non-local Block and Spatial Attention
The structure of the non-local block  is depicted in Fig. 1(a). For a given input feature map , where and are the channel number and spatial size of the feature map, respectively. Three convolutions: , and are used to transform to the query, key and value embeddings respectively, i.e. and . Then a dense affinity matrix, , is computed by a matrix multiplication and softmax normalization as
where is the normalized similarity between and , and , and are spatial location indexes of query and key respectively. After generating the affinity matrix, another matrix multiplication is used to aggregate the contextual information with value contents as follows,
The final output can be computed as
where refers to a convolution. In such scenario, long-range dependencies are captured and used to provide rich contextual information.
3.2 Sparse Non-local Block
Although the non-local block has proven effective for semantic segmentation [7, 20], it is very time consuming. The review of non-local block in Sec. 3.1 indicates that the matrix multiplications in Eqn. 1 and Eqn. 2 dominate most of the computational resources, and the computational complexity is . With this consideration, if the calculation of the dense affinity matrix could be replaced with a small sparse affinity matrix, the computational complexity would be reduced.
In the proposed sparse non-local (SNL) block, we sample a subset of key elements to multiply with each query element and generate a sparse affinity matrix denoted as , where is the number of sampled key elements. The sampled key elements should be representative and the sampling region should cover large area to encode contextual information, and the number of sampled elements should also be small to reduce the computational complexity.
Motivated by the deformable convolution , we restrict the initial sampling locations to the neighbouring locations of the corresponding query pixel, and then shift the locations with offsets , which is learned by a convolution performed on , and the channel dimension corresponds to 2D offsets. Hence, the final sampling coordinates of each key element are calculated as
where and denote the the initial coordinates and the corresponding offsets in
, respectively. As the offsets are typically fractional, bilinear interpolation is used to sample key values and make the offsets computation differentiable. The sampled key elementis calculated as
where is the bilinear interpolation function. For and , the pairwise similarity can be computed as
Afterwards, the value contents are sampled at the same locations as , and the contextual information is aggregated as
Because of replacing the dense matrix multiplications in the non-local block with sparse matrix multiplications defined by Eqn. 6 and Eqn. 7, the computational complexity is reduced to , which is significantly lower than with 111 is 2401 for an input feature map of spatial size when evaluating on cityscapes dataset, and is 81 in our model..
The structure of SNL block is depicted in Fig. 1(b). First, four convolutions are performed on the input features to generate query, key, value and offsets, respectively. Next, key elements are sampled for each query element based on the offsets, and then are multiplied by the corresponding query element and normalized to generate the sparse affinity matrix. Then, the generated matrix is multiplied with the value features to obtain . Finally, a convolution is applied for feature fusion.
3.3 Network Architecture
The entire network architecture is shown in Fig. 2. We use MobileNetV2  and ResNet-101  as our backbones. Following the previous studies , we remove the last downsampling operation in the backbones and utilize dilated convolutions in the last stage to maintain the size of receptive field. We reduce the number of channels to 512 in ResNet-101-based network and 256 in MobileNetV2-based network with a convolution, and then we apply the SNL block on the reduced features. Besides, a global average pooling branch is introduced and concatenated with the reduced features to provide image-level information and improve the performance. Finally, we also employ a simple decoder. Output features of the encoder part are first upsampled, and then concatenated with the reduced low-level features from stage 2 in the backbone. After the concatenation, two convolutions are used to fuse features before the final classifier.
4.1 Implementations and Datasets
We conducted all the experiments based on PyTorch
with CUDA 10.0. We employed ImageNet-pretrained
MobileNetV2 and ResNet-101 as the backbones, and stochastic gradient descent (SGD) algorithm with momentum 0.9 and weight decay 0.0001 was used to train the networks on all the datasets. The “poly” learning rate policy (the learning rate is multiplied by with
) was used, and the initial learning rate was set to 0.005. We employed random horizontal flipping, random scaling and random cropping for data augmentation. The mean Intersection-over-Union (IoU) metric was used as the evaluation metric.
Cityscapes. The dataset contains 5000 finely annotated images. It is divided into 2975, 500 and 1525 for training, validation and test, respectively. All images are of size , and 19 semantic classes are used for segmentation.
PASCAL Context. The dataset provides dense semantic annotations for the PASCAL VOC 2010 images and contains 4998 training and 5105 validation images. Following the previous work , we used the 59 classes annotations to train the networks and to measure the performances.
ADE20K. The dataset is a challenging segmentation dataset which contains 150 semantic classes. The training and validation sets consist of 20K and 2K images respectively.
4.2 Ablation Studies
To evaluate the performance of the proposed method, we first conducted ablation studies on the Cityscapes validation set with single scale testing.
Ablation on sampling number. We adopted a simple MobileNetV2-based FCN as the baseline, and it achieved 70.39% mean IoU result. Then we integrated the sparse non-local block at the end of backbone. We adopted different schemes to evaluate the effect of the SNL block by changing the number of sampled key and value elements for each query element. Results are shown in Table 1, indicating that the sparse non-local block gave better results. Specifically, the improvements comparing with the baseline increased with sampling more key elements, indicating that encoding more contextual information are beneficial for segmentation task. When sampling number was larger than 81, the improvements began to drop, this may due to that some key elements contain redundant or unhelpful information, leading to distortion. Hence we selected 81 as the sampling number in the final architecture.
|Method||NoS||Mean IoU (%)|
|Baseline + SNL||9||73.54|
|Baseline + SNL||49||74.26|
|Baseline + SNL||81||74.51|
|Baseline + SNL||99||73.98|
|Baseline + SNL||121||73.96|
|Baseline + SNL + GP||81||74.94|
|Baseline + SNL + GP + Decoder||81||75.85|
|Method||Inf. time (ms)||Mean IoU (%)|
|+ PPM ||41||72.88|
|+ ASPP ||55||74.48|
|+ NL ||42||73.75|
Ablation on global pooling. As shown in Table 1, adopting the global average pooling path improved the mean IoU from 74.51% to 74.94%, which indicates the effect of this branch.
Ablation on decoder. A simple decoder was employed at the end of the network to replace the naive bilinear upsampling operation. Accuracy was improved by around 0.9% (74.94% 75.85%), as shown in Table 1. This improvement mainly came from the low-level features, which provided spatial information to help refine boundaries.
Ablation on SNL block. To compare the performance of our approach with other context aggregation approaches, we replaced the SNL block with the PPM , the ASPP module  and the standard NL block 
, respectively, in the MobileNetV2 backbone, and then we evaluated these models and estimated the inference time withinput images on a Titan V100 GPU. The results are reported in Table 2. It can be seen that our approach outperformed all the other methods with faster inference speed. This indicates spatial attention is a more effective way to encode multi-scale contextual features, and using part of key elements instead of all can sift out irrelevant contextual information.
4.3 Comparisons with State-of-the-Art Methods
Based on the ablation studies, we designed the sparse spatial attention network with ResNet-101 backbone and SNL block, and evaluated it on the Cityscapes, PASCAL Context and ADE20K datasets using multi-scale testing strategy.
The comparisons with the state-of-the-art methods on the cityscapes test set are shown in Table 3. SSANet outperformed these previous methods with 81.8% mean IoU. For training on the PASCAL Context and ADE20K datasets, we changed the sampling number in SNL block to 49 to fit the smaller cropping size (i.e., for PASCAL Context and for ADE20K). As shown in Table 4, SSANet achieved state-of-the-art performance on both datasets, showing the robustness of our approach.
|Method||Backbone||Mean IoU (%)|
In this paper, we present the sparse spatial attention network (SSANet) for semantic segmentation. A sparse non-local (SNL) block is proposed and integrated in the network. It utilizes spatial attention mechanism to aggregate multi-scale contextual information and capture long-range dependencies adaptively to improve the performance. Different from the standard non-local block, the dense affinity matrix is replaced with a sparse affinity matrix in the proposed SNL block to improve efficiency and accuracy. The ablation experiments show that SNL block significantly improves segmentation accuracy and achieves the best performance comparing with other context aggregation approaches. SSANet has achieved state-of-the-art results on the Cityscapes, PASCAL Context and ADE20K datasets, demonstrating the benefit and effectiveness of the proposed method.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. CVPR, 2015, pp. 3431–3440.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, 2015, pp. 234–241.
-  L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proc. ECCV, 2018, pp. 801–818.
-  W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see better,” arXiv:1506.04579, 2015.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proc. CVPR, 2017, pp. 2881–2890.
-  L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” TPAMI, vol. 40, pp. 834–848, 2018.
-  J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proc. CVPR, 2019, pp. 3146–3154.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object detectors emerge in deep scene cnns,” arXiv:1412.6856, 2014.
-  C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters–improve semantic segmentation by global convolutional network,” in Proc. CVPR, 2017, pp. 4353–4361.
-  M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” in Proc. CVPR, 2018, pp. 3684–3692.
-  X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. CVPR, 2018, pp. 7794–7803.
-  H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia, “Psanet: Point-wise spatial attention network for scene parsing,” in Proc. ECCV, 2018, pp. 267–283.
-  Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai, “Asymmetric non-local neural networks for semantic segmentation,” in Proc. ICCV, 2019, pp. 593–602.
-  Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proc. ICCV, 2019, pp. 603–612.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele,
“The cityscapes dataset for semantic urban scene understanding,”in Proc. CVPR, 2016, pp. 3213–3223.
-  R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, and A. Urtasun, R.and Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proc. CVPR, 2014, pp. 891–898.
-  B. Zhou, H. Zhao, X. Puig, S. Xiao, T.and Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” IJCV, vol. 127, pp. 302–321, 2019.
-  L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv:1706.05587, 2017.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv:1511.00561, 2015.
-  X. Zhu, D. Cheng, Z. Zhang, S. Lin, and J. Dai, “An empirical study of spatial attention mechanisms in deep networks,” in Proc. ICCV, 2019, pp. 6688–6697.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proc. ICCV, 2017, pp. 764–773.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. CVPR, 2018, pp. 4510–4520.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
J. Deng, W. Dong, L. Socher, R.and Li, K. Li, and L. Fei-Fei,
“Imagenet: A large-scale hierarchical image database,”in Proc. CVPR, 2009, pp. 248–255.
-  X. Yuan, Y.and Chen and J. Wang, “Object-contextual representations for semantic segmentation,” arXiv:1909.11065, 2019.
-  T. Ke, J. Hwang, Z. Liu, and S. X. Yu, “Adaptive affinity fields for semantic segmentation,” in Proc. ECCV, 2018, pp. 587–602.
-  H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in Proc. CVPR, 2018, pp. 7151–7160.