Log In Sign Up

Boundary Corrected Multi-scale Fusion Network for Real-time Semantic Segmentation

Image semantic segmentation aims at the pixel-level classification of images, which has requirements for both accuracy and speed in practical application. Existing semantic segmentation methods mainly rely on the high-resolution input to achieve high accuracy and do not meet the requirements of inference time. Although some methods focus on high-speed scene parsing with lightweight architectures, they can not fully mine semantic features under low computation with relatively low performance. To realize the real-time and high-precision segmentation, we propose a new method named Boundary Corrected Multi-scale Fusion Network, which uses the designed Low-resolution Multi-scale Fusion Module to extract semantic information. Moreover, to deal with boundary errors caused by low-resolution feature map fusion, we further design an additional Boundary Corrected Loss to constrain overly smooth features. Extensive experiments show that our method achieves a state-of-the-art balance of accuracy and speed for the real-time semantic segmentation.


page 1

page 2


Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes

Semantic segmentation is a critical technology for autonomous vehicles t...

Revisiting Multi-Scale Feature Fusion for Semantic Segmentation

It is commonly believed that high internal resolution combined with expe...

DRBANET: A Lightweight Dual-Resolution Network for Semantic Segmentation with Boundary Auxiliary

Due to the powerful ability to encode image details and semantics, many ...

Boundary-Aware Network for Fast and High-Accuracy Portrait Segmentation

Compared with other semantic segmentation tasks, portrait segmentation r...

Panoptic SwiftNet: Pyramidal Fusion for Real-time Panoptic Segmentation

Dense panoptic prediction is a key ingredient in many existing applicati...

Real-Time Semantic Segmentation via Multiply Spatial Fusion Network

Real-time semantic segmentation plays a significant role in industry app...

Dilated SpineNet for Semantic Segmentation

Scale-permuted networks have shown promising results on object bounding ...

1 Introduction

As a basic vision task, image semantic segmentation [11]

is crucial for scene understanding. Its goal is to assign a semantic category label for each image pixel. With the development of deep learning and improved computing resources, Convolutional Neural Networks (CNN) are applied to image segmentation and significantly outperform traditional methods based on hand-crafted features. The end-to-end fully convolutional neural network


method greatly promotes the rapid development of CNN in semantic segmentation. Then various forms of feature extraction and fusion modules have been proposed to improve the accuracy of the model

[1, 2, 3, 4]. However, most of the existing methods are designed to get high classification accuracy on each pixel with high-resolution images/feature maps and can not meet the speed requirements in deployment. Therefore, some researchers recently focus on the designed of real-time and efficient models [15, 22, 21], which has more potential value in practical application.

Figure 1:

Segmentation result of our proposed BCMFNet on the test set of Cityscapes and CamVid datasets.

Recently, several networks emerged based on various efficient backbone networks, such as [14, 9, 13] based on ResNet-18 [7]. However, with the proposal of better backbone networks, these structures designed for specific backbones are difficult to migrate. The rest start developing new lightweight networks. BiSeNet[22] proposes a new dual-branch network to solve the problem of the limited receptive field. ICNet[23] proposes cascaded networks that fuse the details of high- and low-resolution feature maps. But their processing of high-resolution feature maps limits the speed of the network. CABiNet[10] proposes a dual-branch structure to extract spatial details and contextual information. DDRNet[8] proposes a dual-resolution network and a cascaded multi-scale feature extraction module. However, none of their processing for dual branches takes into account the fine boundary features of the image.

Based on the above observations, we propose a new method, the Boundary Corrected Multi-scale Fusion Network (BCMFNet), with multi-scale feature fusion and boundary corrected loss. Considering the computational constraints of lightweight models, we propose a feature fusion method to perform multi-scale feature extraction on low-resolution feature maps. It is used to eliminate the high computational cost caused by high-resolution feature maps calculation and obtain more contextual information. In addition, for the problem of missing fine boundaries caused by the fusion of low-resolution feature maps, we use a boundary corrected loss to extract hard samples of boundaries, capture the long-distance information of feature maps, and improve the boundary perception of the model (Figure 1).

Figure 2: The overview of our proposed BCMFNet method. The LMFM is marked in light yellow region, and BCL is marked in green region. Black solid lines denote information paths with data processing and black dashed lines denote information paths without data processing.

Our contribution mainly includes four aspects: (1) A new real-time semantic segmentation method is proposed, named Boundary Corrected Multi-scale Fusion Network (BCMFNet). (2) A novel Low-resolution Multi-scale Fusion Module (LMFM) is designed to fuse low-resolution feature maps with multi-scale features to obtain richer contextual information. (3) An additional Boundary Corrected Loss (BCL) function is introduced to enhance the learning of hard samples with correction of fine boundaries. (4) Our method achieves a good balance between accuracy and speed with 78.2% mIoU at 102 FPS on the cityscape dataset and 76.2% mIoU at 230 FPS on the CamVid dataset.

2 Method

In this section, we introduce our boundary corrected multi-scale fusion network (BCMFNet) in detail. In Section 2.1, we first describe the overall architecture of BCMFNet, including the network structure and the overall objective function during training. Then we introduce our proposed LMFM and BCL in the next two subsections.

2.1 Overall Structure

As shown in Figure 2, our method builds a novel network. The overall process of BCMFNet mainly includes four stages: feature extraction, low-resolution feature fusion, boundary hard sample correction, and upsampling.

At the feature extraction stage, we use DDRNet-s as the backbone network to fully fuse spatial and semantic information through multiple high- and low-resolution bilateral feature fusion. At the low-resolution feature fusion stage, we use LMFM to extract the information of low-resolution feature maps. At the boundary hard sample correction stage, we use BCL to correct the boundary loss problem caused by over-smoothing. In the training mode, the overall objective function of BCMFNet is as follows:


where is the cross-entropy loss, is the boundary loss, and is a hyper-parameter to balance these two components. is the prediction result obtained by softmax from the feature map. and

are the ground truth before and after flipping. Finally, we use channel compression on the low-resolution feature maps using 1×1 convolutions and upsample them using bilinear interpolation.

2.2 Low-resolution Multi-scale Fusion Module

The images contain objects of different sizes, and it is effective to process the feature maps at different scales proposed in global convolutional network (GCN) [16] and Inception [19]. As shown in Figure 3, we compare different methods of extracting feature maps. The traditional bottleneck module [7] uses 3×3 convolution kernels in each layer of the network, which limits the receptive field due to the small and fixed size of the convolution kernel. GCN adopts a multi-branch computational structure to improve the accuracy of segmentation, but a large amount of computation on high-resolution images limits the speed of the model. DDRNet-s [8] uses a cascaded method to fuse the information of each layer upward, but the smooth information of the low layer has a great influence on the feature map of the high layer, and it is easy to lose the fine boundary information.

To address these issues, we propose a new module to extract contextual information from the low-resolution feature maps of the model. Figure 3(d) shows the specific structure of the LMFM. After the feature map is fed into the module, the average pooling layer is used to mine more information, and gradually larger pooling kernels are used for generating feature maps with input image resolutions of 1/2, 1/4, and 1/8. To utilize the information generated by GAP, we perform feature fusion by using 3×3 and 1×1 convolution kernels multiple times. The fusion method adopts interval connection as shown in Figure 3(d), and concatenates the generated feature maps. Compared with the connection method of Res2Net, this method can extract context information better. In addition, inspired by the design of the residual network, we also add a 1×1 convolution kernel for fast connection to prevent the loss of shallow information.

Inside the LMFM, the feature map extracted by the smaller pooling kernel is processed at a deeper level to obtain deeper information, combined with the shallower information received by the larger pooling kernel. Compared with the connection method of DAPPM, LMFM can combine information more effectively to form multi-scale attributes at different levels. Adding this module to the low-resolution feature map can obtain richer contextual information without affecting the inference speed of the model.

Figure 3: Comparison of different convolution blocks. (a) is bottleneck[7], (b) is GCN[16], (c) is DAPPM[8], and (d) is our proposed LMFM.

2.3 Boundary Corrected Loss

Semantic segmentation aims to obtain labels for all target image pixels. We believe that boundary pixels are more likely to generate hard samples for semantic segmentation. The model’s learning of boundaries can be enhanced by detecting hard samples. The current steps of online hard sample processing are as follows: 1) get feature maps generated by CNN; 2) calculate loss between obtained feature maps and ground truth; 3) evaluate the reliability of samples by calculated loss; 4) select samples with the larger loss as hard samples. However, this approach ignores the influence of boundaries on feature maps. In addition, boundary smoothing caused by multi-scale feature fusion is widespread.

Aiming at these problems, we propose a simple and effective flip boundary corrected loss function. This function captures long-distance relationships in isolated regions by flipping entire rows or columns of pixels and detects boundary-hard samples online. This method can correct the boundary smoothing caused by low-resolution feature fusion. We believe that the higher the loss, the greater the probability that the position is a boundary-hard sample. Specifically, the algorithm flow is shown in Stage III Figure


The process of BCL is as follows: 1) Use the softmax layer to process the feature map; 2) Row flip and column flip ground truth: flip adjacent pixels according to the set step size; 3) Calculate the cross-entropy loss between the original feature map and the original ground truth. Calculate the cross-entropy loss of the feature map and the ground truth transformed by the row and column; 4) Use the Non-Maximum Suppression to filter the obtained loss, arrange the loss from large to small, and select the boundary-hard samples according to the input threshold; 5) Calculate the weighted sum of the selected sample loss and the original cross-entropy loss. BCL can be expressed as:


where represents the predicted result of the pixel, is the ground truth before and after flipping, and represent the ground truth after row and column transformation, and represent the weighting coefficient, represents the number of categories and is batch size. To balance the speed and efficiency of the model under a limited computational budget, we adopt different moving steps and thresholds to improve the representation ability of the network and reduce the computational complexity.

Model Roa Sid Bui Wal Fen Pol TLi TSi Veg Ter Sky Ped Rid Car Tru Bus Tra Mot Bic mIoU
SegNet [1] 96.4 73.2 84.0 28.4 29.0 35.7 39.8 45.1 87.0 63.8 91.8 62.8 42.8 89.3 38.1 43.1 44.1 35.8 51.9 57.0
ENet [15] 96.3 74.2 75.0 32.2 33.2 43.4 34.1 44.0 88.6 61.4 90.6 65.5 38.4 90.6 36.9 50.5 48.1 38.8 55.4 58.3
ICNet [23] 97.1 79.2 89.7 43.2 48.9 61.5 60.4 63.4 91.5 68.3 93.5 74.6 56.1 92.6 51.3 72.7 51.3 53.6 70.5 69.5
LEDNet [20] 98.1 79.5 91.6 47.7 49.9 62.8 61.3 72.8 92.6 61.2 94.9 76.2 53.7 90.9 64.4 64.0 52.7 44.4 71.6 70.6
DDRNet-s [8] 98.1 84.4 92.0 53.3 59.4 61.2 68.8 76.3 92.1 65.2 94.1 80.1 60.9 94.8 84.3 88.3 76.8 61.2 75.4 77.2
BCMFNet (Ours) 98.2 84.6 92.2 54.0 61.3 64.8 71.4 78.1 92.7 69.8 94.9 81.7 64.0 95.1 81.4 87.9 79.0 60.0 75.8 78.2
Table 1: Individual category results and mIoU scores on the CityScapes dataset.

3 Experiments

3.1 Datasets and Metrics

Our method is evaluated on the public Cityscapes [6] and CamVid [5] datasets. Cityscapes is a dataset of urban landscapes, which contains a total of 5000 images. 2975 images are used for the training set, 500 images are used for the validation set, and 1525 images are used for the test set. The dataset includes 19 categories, and the resolution of the images is 2048×1024. CamVid contains a total of 701 images, with 367 images for the training set, 101 images for the validation set, and 233 images for the test set. There are 32 categories in total. 11 categories are used for the semantic segmentation task, and the resolution of the images is 960×720.

In experiments, we employ mean cross-union (mIoU) to evaluate model accuracy and use FPS, GFLOPs, and Params to measure model efficiency.

Model mIoU FPS GFLOPs Params
SegNet[1] 57 16.7 286 29.5M
ENet[15] 57 135.4 3.8 0.4M
BiSeNet1[22] 68.4 105.8 14.8 5.8M
BiSeNetV2[21] 72.6 156 21.1 -
ICNet[23] 69.5 30.0 - 7.8M
LEDNet[20] 70.6 71.0 - 0.94M
CABiNet[10] 75.9 76.5 12.0 2.64M
DDRNet-s[8] 77.4 101.6 36.3 5.7M
BCMFNet (Ours) 78.2 103.4 35.8 5.6M
Table 2: Accuracy and speed comparison on Cityscapes dataset.
Model mIoU FPS
DFANet[12] 64.7 120
BiSeNet1[22] 65.6 175
BiSeNetV2[21] 72.4 124
MSFNet[18] 75.4 91
DDRNet-s[8] 74.7 230
BCMFNet (Ours) 76.2 225
Table 3: Accuracy and speed comparison on CamVid dataset.
1 76.8
2 77.6
3 78.2
Table 4: Comparison results of the proposed modules on Cityscapes dataset.

3.2 Implementation Details

The structure of the model is described in Section 2. For Cityscapes dataset, we randomly crop the input size to 1024×1024. We use the SGD optimizer with the momentum of 0.9 and the weight decay of 0.0005. We train models with a batch size of 8 and a initial learning rate of 0.1. For CamVid dataset, we randomly crop the input size to 960 × 720, set the batch size to 16, and the rest of the settings are the same as Cityscapes dataset.

3.3 Results

In experiments, we compare our method with state-of-the-art real-time semantic segmentation methods. Results on Cityscapes dataset are shown in Table 2. Our method achieves the best balance between real-time speed and high accuracy and is faster than baseline for inference. As can be seen from the single category results in the Table 1, out of 19 categories, we have 16 the best scores. It can be seen from the Table 1 and Figure 1 that our method has higher accuracy for categories with large intersection areas and well-defined boundaries, such as roads, fences, traffic lights, etc. Results on CamVid dataset are shown in Table 3. Our results show a 1.5% improvement over baseline while maintaining comparable inference speed. Figure 1 shows partial results on the CamVid dataset.

3.4 Ablation Study

In this section, we further discuss and analyze the effectiveness of key components in BCMFNet. We conduct experiments on the difficult dataset under the fixed backbone network by controlling other conditions. As shown in Table 4, BASE(Index-1) is without our proposed LMFM and BCL components. Compared to baseline, LMFM brings 0.8% improvement, and BCL brings 0.6% improvement. Using LMFM and BCL at the same time achieves the best performance, with the accuracy increasing from 76.8% to 78.2% with little impact on the model’s inference speed. The results show the good complementarity of LMFM and BCL, which can jointly optimize the model.

4 Conclusion

In this paper, we are devoted to the real-time semantic segmentation of road scenes. In our BCMFNet, a new module LMFM is introduced, which connects the low-resolution multi-scale feature maps of the model to obtain richer contextual information while reducing the amount of computation. In addition, we propose the BCL to enhance learning of hard samples by detecting boundary-hard samples online. The experimental results show that the proposed method has better performance under the same conditions compared with the existing models.


  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI 39 (12), pp. 2481–2495. Cited by: §1, Table 1, Table 2.
  • [2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR (Poster), Cited by: §1.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40 (4), pp. 834–848. Cited by: §1.
  • [4] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. Cited by: §1.
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR, pp. 3213–3223. Cited by: §3.1.
  • [6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR, pp. 3213–3223. Cited by: §3.1.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, Figure 3, §2.2.
  • [8] Y. Hong, H. Pan, W. Sun, and Y. Jia (2021) Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. CoRR abs/2101.06085. Cited by: §1, Figure 3, §2.2, Table 1, Table 2, Table 3.
  • [9] P. Hu, F. Perazzi, F. C. Heilbron, O. Wang, Z. Lin, K. Saenko, and S. Sclaroff (2021) Real-time semantic segmentation with fast attention. IEEE Robotics Autom. Lett. 6 (1), pp. 263–270. Cited by: §1.
  • [10] S. Kumaar, Y. Lyu, F. Nex, and M. Y. Yang (2021) CABiNet: efficient context aggregation network for low-latency semantic segmentation. In ICRA, pp. 13517–13524. Cited by: §1, Table 2.
  • [11] F. Lateef and Y. Ruichek (2019) Survey on semantic segmentation using deep learning techniques. Neurocomputing 338, pp. 321–348. Cited by: §1.
  • [12] H. Li, P. Xiong, H. Fan, and J. Sun (2019)

    DFANet: deep feature aggregation for real-time semantic segmentation

    In CVPR, pp. 9522–9531. Cited by: Table 3.
  • [13] X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, S. Tan, and Y. Tong (2020) Semantic flow for fast and accurate scene parsing. In ECCV (1), Vol. 12346, pp. 775–793. Cited by: §1.
  • [14] M. Orsic, I. Kreso, P. Bevandic, and S. Segvic (2019)

    In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images

    In CVPR, pp. 12607–12616. Cited by: §1.
  • [15] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) ENet: A deep neural network architecture for real-time semantic segmentation. CoRR abs/1606.02147. Cited by: §1, Table 1, Table 2.
  • [16] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun (2017) Large kernel matters - improve semantic segmentation by global convolutional network. In CVPR, pp. 1743–1751. Cited by: Figure 3, §2.2.
  • [17] E. Shelhamer, J. Long, and T. Darrell (2017) Fully convolutional networks for semantic segmentation. IEEE TPAMI 39 (4), pp. 640–651. Cited by: §1.
  • [18] H. Si, Z. Zhang, and F. Lu (2020) Real-time semantic segmentation via multiply spatial fusion network. In BMVC, Cited by: Table 3.
  • [19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: §2.2.
  • [20] Y. Wang, Q. Zhou, J. Liu, J. Xiong, G. Gao, X. Wu, and L. J. Latecki (2019) Lednet: A lightweight encoder-decoder network for real-time semantic segmentation. In ICIP, pp. 1860–1864. Cited by: Table 1, Table 2.
  • [21] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang (2021) BiSeNet V2: bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 129 (11), pp. 3051–3068. Cited by: §1, Table 2, Table 3.
  • [22] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) BiSeNet: bilateral segmentation network for real-time semantic segmentation. In ECCV (13), Vol. 11217, pp. 334–349. Cited by: §1, §1, Table 2, Table 3.
  • [23] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia (2018) ICNet for real-time semantic segmentation on high-resolution images. In ECCV (3), Vol. 11207, pp. 418–434. Cited by: §1, Table 1, Table 2.