Semantic segmentation refers to the problem of estimating the class label for all pixels given the input image. It is a fundamental task for many computer vision applications. Thanks to the remarkable development of deep convolutional networks[10, 27, 8, 15, 12], there is substantial progress for this task. However, many algorithms require sophisticated models that come with large memory and computational cost. This is challenging for robotics applications where real-time performance is required and the computation resource is usually limited. A better latency-accuracy trade-off is worth investigation.
Current deep learning based semantic segmentation algorithm usually contains a front-end network and a back-end network. A backbone network pre-trained with large-scale image classification task is usually employed as the front-end for feature extraction. The back-end network, on the other hand, usually contains a module for multi-scale context aggregation (e.g., ASPP[2, 4], PPM , RefineNet 
), as well as some sequential convolution layers to generate the final dense class probability map. Since the front-end network needs to extract high level semantic information, it usually requires deep networks with large numbers of parameters. To accelerate the algorithm, currently real-time segmentation methods employ two and multiple branch architecture[33, 39, 37] for the front-end. In particular, branch with deeper network is fed with a lower resolution version of the image while branch with shallower network deals with the higher resolution version. On the other hand, some approaches achieve real-time performance by designing thin network structure [25, 28]. But accuracy of the segmentation result usually degrades substantially with real-time acceleration techniques. Better trade-off between accuracy and speed stills requires investigation. Fig. 1 shows an overview about the speed and accuracy of some current real-time semantic segmentation methods.
Since the images may contain a same class in different scales, how to leverage the context information is important for thin and small objects. Multi-scale context aggregation with deep network plays an important role for semantic segmentation. In fact, the back-end network usually needs to accomplish this. Current context aggregation modules, such as Atrous Spatial Pyramid Pooling (ASPP) [2, 4], have shown the effectiveness and rank among the state-of-the-art methods. However, these modules usually work in deep layers of the network where the number of feature map channel is large. In this case, even a convolutional layer with kernel size 3 consumes substantial computation. In this work, we designed a factorized ASPP module for multi-scale context fusion. Moreover, since this module is light-weight, we can repeatedly employ it for stronger context fusion without obvious increase of computation. Here we named this module cascaded factorized ASPP (CF-ASPP).
Another issue of the back-end network for semantic segmentation is that the spatial size of the feature maps decreases substantially after the front-end. In addition, many approaches improve the speed by using lower-resolution images as input directly. This makes it more challenging for the back-end network. Many current back-end networks simply perform up-sample by parameter-free operations such as bilinear interpolation, on the feature map to obtain the result with original resolution. In this case, it is hard to recover the details for the final segmentation result. Mazzini 
propose Guided Up-sampling Network (GUN) to better recover from the low-resolution feature map. In particular, the network learns to predict a high-resolution guidance offset table of offsets vectors that steer sampling towards the correct semantic class. In this work, we solve this problem from another perspective. In the training process, we use lower-resolution input image (e.g., for Cityscapes dataset instead of ) but keeps the high-resolution ground truth for supervision. In fact, this is the problem of recovering high-resolution output from low-resolution input. This problem has been widely studied under the topic of super-resolution . We can leverage the development of super-resolution to solve this problem. This process can be performed by current highly optimized operators, which we fuse with the proposed CF-ASPP.
To this end, in this work, we propose a new back-end network for semantic segmentation. In particular, we propose the CF-ASPP module to efficiently perform multi-scale context aggregation and employ a feature map super-resolution step that can better recover high-resolution result from low-resolution input. Our whole pipeline is shown in Fig. 2. To summarize, our contributions are three-fold:
We treat the problem of recovering high-resolution segmentation result from low-resolution input as a super-resolution process. The experimental result shows that given lower resolution input image, the performance degrade of our method is lower than other methods. This helps to accelerate the algorithm while keeping reasonable accuracy.
We provide a new back-end network for semantic segmentation. The proposed network provides better latency-accuracy trade-off than current state-of-the-art real-time semantic segmentation methods [25, 39, 37]. In addition, the proposed back-end network is easy to implement and can be directly combined with other existing feature extraction network. That means our approach can benefit from the advance of common feature extraction network which is another important computer vision topic [29, 38, 23].
Ii Related Work
Ii-a Quality Driven Semantic Segmentation
Since the introduction of FCN , there are extensive research works on deep learning for semantic segmentation. How to model the spatial relationship between pixels or leverage the context for inference is the main concern of current methods. For example, the skip-connection  is widely employed to fuse the high level semantic feature and low level spatial cues; ASPP [2, 3, 4] and PPM  is the component applied on the extracted feature from front-end to fuse context of multiple scales; CRF  and MRF  is used to model the spatial relationship between pixels or regions. Recently, HRNet  is proposed for semantic segmentation. The whole network maintains multiple branches with different resolutions of the image. The representation of different resolutions is densely fused. These methods achieve high quality results at the cost of heavy computation.
Ii-B Real-Time Semantic Segmentation
Recently, real time semantic segmentation attracts more and more researchers. Research works among this line aims to improve the model inference time while keeping decent accuracy. ENet  is one of the pioneers in this line. The authors design a light-weight network structure and the input image is heavily down-sampled at the early stage of the network to reduce processing time. ERFNet  takes another approach, where the network contains multiple factorized convolution blocks. In particular, the combinations of and convolution is used to reduce the computation of original convolution. Similar structure is also used in . Two-branch and multi-branch network design is also proposed. ICNet , ContextNet , BiSeNet  and Fast-SCNN  learn global image context with lower resolution input with a deeper network branch, which is then combined with the feature from a shallower branch that describes the boundary information using higher resolution input, or feature map obtained from the early stage of the deeper branch. Reusing feature is another direction to reduce the computation cost. Most recently, Li et al.  design a network that contains multiple sub-modules. Features from early module is reused for the following module. Different from these approaches, we treat the whole segmentation network as a combination of front-end and back-end, and our work does not need specific backbone/feature-extraction network design for the front-end network. We focus on how to design an efficient back-end network. In fact, the module proposed in this work can be plugged into other feature-extraction networks, such as MobileNet  and ShuffleNet .
Ii-C General Deep Network Acceleration
In addition to the real time network for semantic segmentation research, there are extensive studies on general deep network acceleration for applications such as object detection and image classification. For example, network quantization  is applied for convolution parameters for better inference speed than the floating point computation. On the other hand, network compression technique either uses a pruning operation  to reduce the network structure or use a bigger network to guide the training of a smaller network . Recently, many light-weight backbone networks, such as MobileNet  and ShuffleNet , are also proposed for efficient feature extraction with light-weight building blocks. Note that these research works are orthogonal to our work. The proposed network can benefit from the advance of these directions.
Iii-a Approach Overview
Given an input RGB image , deep learning based semantic segmentation method usually consists of successive convolution layers and the output is , where denotes the number of class for the segmentation task, and is the class probability for each pixel. and denotes the height and width of the image, respectively. Differently, in our formulation we allow low-resolution input and high-resolution output to accelerate the process so we have and . The network contains the front-end and back-end as shown in Fig. 2. The front-end is for feature extraction, where we can employ existing deep models, such as VGG , ResNet , and MobileNet . We can extract the feature from one or more intermediate layers from these networks. Since this part is not the main focus for this paper, we do not explain it in detail. For the back-end network, the input feature map is extracted from the front-end and the output is the probability map . Here we employ the proposed cascaded factorized ASPP (see Sec. III-B) that is fused with feature space super-resolution (see Sec. III-C).
Iii-B Cascaded Factorized ASPP
Due to the variations of the object size, how to capture and fuse image feature in different scales is important for semantic segmentation. Atrous convolution  has been employed extensively for semantic segmentation. Different from conventional convolution filters, atrous convolution filters can be treated as inserting zeros between two neighboring filter values along each spatial dimension, where is the atrous rate. We can see that atrous convolution allow us to enlarge the filter’s field-of-view without increasing the number of model parameters and computation cost. On the other hand, inspired by spatial pyramid pooling method of , Chen et al.  propose atrous spatial pyramid pooling (ASPP) for semantic segmentation. In particular, in ASPP, there are multiple parallel atrous convolutions with different atrous rates. Those parallel atrous convolutions are applied on top of the feature map from a feature extraction network. We can see that the multiple atrous rates can help the network to capture multi-scale context information. However, the ASPP is applied on the feature map obtained from a deep network. The channel number of the feature map is usually large (e.g.., 512 for ResNet 18 ). Even with the kernel size of , ASPP still consumes substantial computation cost. To reduce this problem and further increase the effectiveness of ASPP, we do the following modifications.
Firstly, we decompose the atrous convolution layer into two layers: 1) point-wise convolution layer (i.e., kernel size is ) that linearly combines the input channels and reduce the dimension of the output channel dimension; This layer is to perform channel-wise information interaction; 2) depth-wise and atrous convolution with the same kernel size (i.e., ) and atrous rate as the original atrous convolution. The depth-wise convolution here is to reduce the computation cost. This layer is to enable the model to capture the feature of the neighboring area. Similar factorization is also used in [29, 10]. Differently, depth-wise convolution is employed here instead of conventional convolution. To this end, we have the factorized ASPP (F-ASPP) module. Let and be the input and output channel number, we can see that for the original atrous convolution, the computation complexity for a feature map is , while the computation complexity of the factorized one is +. Given that and is 512, and 256 respectively in our implementation, we can see that the computation cost is about 8.8x reduced.
Secondly, instead of applying the ASPP module only once, we cascade two factorized ASPPs in our network. The motivation here is that we can perform extensive multi-scale context aggregation in this case. On the other hand, the factorize ASPP already reduce the computation a lot compared to the original ASPP and we use less channels in the second F-ASPP, cascading this component does not bring much additional computation cost but improve the accuracy significantly (see our experiment). The detailed structure of the CF-ASPP is illustrated in Fig. 3.
Iii-C Feature Space Super-resolution
In addition to kernel factorization, a simple way to reduce the computation cost is to reduce the input image resolution and up-sample the low-resolution result to the high-resolution one. However, recovering high-resolution result from low-resolution is challenging. Another issue is that after several stages of the front-end network, the feature map spatial size also decreases a lot (e.g.., 1/8, 1/16 of the input size). This is mainly because 1) the front-end network contains sub-sampling layers (e.g.e.g., 512, 1024)) and keeping high-resolution feature map in deep layers will bring too much computation cost. So we need to generate a high-resolution segmentation map using the low-resolution input. A parameter-free interpolation operation (e.g.., bilinear up-sampling) may not be a good solution for this.
In this work, we treat this problem as a super-resolution process in the feature space. Firstly, in the training process, we use down-sampled RGB image as input, and the original high-resolution class label map as the ground truth. Secondly, for the network structure design, we gradually up-sample the feature map in the back-end network. The up-sample operation is performed by sub-pixel convolution  that is widely used for the image super-resolution task. Here we explain how we apply sub-pixel convolution in our work. Given the input feature map and the up-sampling factor , we need to generate an output feature map . Firstly, we apply a pixel-wise convolution layer to and generate a feature map . Secondly, we apply a periodic shuffling operator that rearranges the elements of the feature map to a feature map . Detailed definition of periodic shuffling can be found in . It has been explained in  that compared to the deconvolution operator with the same computation budget, sub-pixel convolution has more representation power. Detailed explanation for this is out of the scope of this paper.
To this end, we have the sub-pixel convolution for feature map super-resolution. Here we apply this operation twice in the proposed CF-ASPP. Fig. 3 illustrates the network details.
Iv-a Evaluation Dataset
We perform the evaluation on the cityscapes dataset  since it is a popular and standard benchmark for semantic segmentation research. The high resolution input (i.e., 10242048 RGB image) makes it challenging for real time segmentation algorithms. We follow the official evaluation protocol of this benchmark. In particular, it contains an image collection with fine annotations of 30 common classes in urban street scenes (e.g., road, car, sky, person). The collection is divided into training, validation, and test split, with 2,975, 500, and 1,525 images, respectively. Images in this dataset are illustrated in Fig. 4. Among the 30 annotated classes, 19 classes are used for evaluation. According to the dataset, the official accuracy criterion is the mean intersection-over-union (IoU) metric. In particular, we have , where TP, FP, and FN denotes the size of true positive, false positive, and false negative, respectively. For the inference speed criterion, we use the direct metric–inference time. We do not use FLOPs because it is an indirect metric. It is usually not equivalent to the direct metric because it cannot reflect some hardware related factors such as memory access cost and degree of parallelism. This has been discussed in .
Iv-B Implementation Details
For the software platform, the proposed algorithm is implemented by Pytorch 1.1 with CuDNN v7.0 (no TensorRT or other inference optimizers to avoid external influence). For the hardware, the program runs on a PC with Nvidia Titan X (Maxwell) GPU. This is to keep the same platform with many other current methods[39, 25, 26] for fair comparison. When we evaluate the running time performance, we use a single CPU thread and a single GPU and we measure the average model inference time one the 500 cityscapes validation images.
In the training process, for the front-end network, we use ResNet-18 
pre-trained with the ImageNet classification task. For the high-level feature from the front-end network, we use the feature from the last convolution layer before the global average pooling layer . For the low-level feature, we use the feature from the layer ‘conv3_x’ of the network. To train the whole network, we use the 2,975 training images with fine annotations from cityscapes dataset. Batch normalization  and Relu activation 
is used after every convolution layer. Stochastic gradient decent (SGD) with momentum 0.9 and batch-size 16 is used in the training process. The initial learning rate is set 0.1 and decayed by a factor of 0.9 every 50 epochs. All the experiments of the proposed method are trained for 400 epochs. We also perform extensive data augmentation as other current methods, including random flipping, rotation, color channel noise, and resizing. Similar to other methods, we use the cross-entropy loss.
|Test ID||Method ID||input size||mIoU (%)||running time (ms)|
Iv-C Ablation Study on Network Structure
To evaluate the proposed method, firstly we perform ablation study using the following baseline methods:
Front-end network with ResNet-18 and back-end network with original ASPP and the decoder from DeeplabV3+ .
Front-end network with ResNet-18. The back-end network contains one F-ASPP (without feature space resolution) and the decoder from DeeplabV3+ .
Front-end network with ResNet-18. The back-end network contains CF-ASPP (without feature space resolution) and no decoder from DeeplabV3+ .
The full proposed method.
From Tab. I, we can have the following observations by comparing different test cases. By comparing Test 1, 2, 3 and 4, we can see that the accuracy of proposed back-end network is better than the original one from DeeplabV3+ . In addition, the accuracy decreases when the input resolution decreases. But the accuracy degradation is smaller for the proposed method. This demonstrates the effectiveness of the proposed feature space super-resolution approach. Although our running time consumption is larger than baseline method 1 for high resolution input, we can see that the situation reverses when given lower resolution input (see Test 5 and 8). This is mainly because the additional computation of the proposed method is caused by the increased number of layers. Given high resolution input, the feature map spatial size is large, memory transfer between layers consumes a large part. Given lower resolution input, the situation reverses. This also shows that our method fits the cases with smaller input size.
By comparing Test 5, 6, 7, and 8, we can see that: 1) By incorporating F-ASPP, the performance can be improved a little, and the improvement can be larger if we employ CF-ASPP. This demonstrates the effectiveness of F-ASPP and cascading F-ASPP. 2) F-ASPP is faster than the original ASPP (Test 5 and 6). 3) With CF-ASPP, both the running time efficiency and accuracy is better than the original one (Test 5 and 7). 4) By feature space super-resolution, we can obtain better result (Test 7 and 8).
|Method||mIoU (%)||running time (fps)|
|Ours ( input)||70.2||68.5|
|Ours ( input)||68.4||84.0|
Iv-D Comparison with Other Methods
We also compare the proposed the method with other state-of-the-art real-time semantic segmentation algorithms including: ENet , ICNet , ContextNet , ESPNetV2 , GUN , Fast-SCNN , ERFNet , BiSeNet , LEDNet  and ThunderNet . The results are listed in Tab. II. The results of other method are from related literatures. We can see that we almost achieve the fastest speed compared to other methods. Although some methods have faster FPS records, but they are evaluated on other GPUs that have higher performance. For example, a same network of  runs 75.3 fps and 123.5 fps on Titan X (Maxwell) and Titan Xp, respectively  (i.e.., 68% faster). For the accuracy, our method is also among the leading competitors. This shows that our method achieves the state-of-the-art latency-accuracy trade-off.
In addition, we test the inference time of the proposed network in a Jetson AGX Xavier embedded system and obtain 21.2fps using the input image size of 512 768. Here we simply test it with an unoptimized implementation (we only use the python interface of Pytorch and do not use the TensorRT library or NVDLA engine for inference acceleration), thus the running time can be further reduced by other engineering adaptation.
Iv-E Qualitative Analysis
Fig. 1 shows some of our result on the cityscapes validation set. We can see that the proposed method can deal with some of the small and thin objects. From the 5th row, we can see that even some cars are occluded, they can still be identified to some extent. This figure is best viewed with zoomed in.
This paper proposes a back-end network component for real time semantic segmentation. On one hand, we propose a cascaded factorized ASPP module for efficient multi-scale context aggregation. On the other hand, to allow the model to use down-sampled lower-resolution image as input and generate high-resolution segmentation output, we employ the super-resolution approach in the feature space. We conduct extensive experiments and the effectiveness of these two aspects have been demonstrated. The latency-accuracy trade-off outperforms many existing state-of-the-art methods. Future research direction can be how to combine the proposed network with other network acceleration techniques such as network compression and quantization.
-  (2015) Semantic image segmentation with deep convolutional nets and fully connected CRFs. In International Conference on Learning Representations, Cited by: §II-A.
-  (2016) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs.. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: item 1, §I, §I, §II-A, §III-B.
-  (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. Cited by: §II-A, §III-B.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation.. In European Conference on Computer Vision, pp. 808–818. Cited by: item 1, §I, §I, §II-A, item 1, item 2, item 3, §IV-C.
The cityscapes dataset for semantic urban scene understanding. In
IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223. Cited by: §IV-A.
-  (2009) ImageNet: a large-scale hierarchical image database. In IEEE conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §IV-B.
-  (2016) Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2), pp. 295–307. External Links: Cited by: §I.
-  (2017) Mask R-CNN. In International Conference on Computer Vision, pp. 2961–2969. Cited by: §I.
-  (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pp. 1094–1916. Cited by: §III-B.
-  (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I, §II-A, §III-A, §III-B, §III-B, §IV-B.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. External Links: Cited by: §II-C.
Batch normalization: accelerating deep network training by reducing internal covariate shift..
International Conference on Machine Learning, pp. 448–456. Cited by: §I, Fig. 3, §IV-B.
-  Cnn-benchmarks. Note: https://github.com/jcjohnson/cnn-benchmarks Cited by: footnote 1, footnote 1.
-  (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4350–4359. Cited by: §II-C.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §I, Fig. 3, §IV-B.
DFANet: deep feature aggregation for real-time semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9522–9531. Cited by: §II-B.
-  (2016) RefineNet: multi-path refinement networks for high-resolution semantic segmentation.. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1925–1934. Cited by: §I.
-  (2015) Semantic image segmentation via deep parsing network. In International Conference on Computer Vision, pp. 1377–1385. Cited by: §II-A.
-  (2015) Fully convolutional networks for semantic segmentation.. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §II-A.
-  (2017) ThiNet: a filter level pruning method for deep neural network compression. In International Conference on Computer Vision, pp. 5058–5066. Cited by: §II-C.
-  (2018) ShuffleNet V2: practical guidelines for efficient CNN architecture design. CoRR abs/1807.11164. Cited by: §IV-A.
-  (2018) Guided upsampling network for real-time semantic segmentation. In British Machine Vision Conference, Cited by: §I, §IV-D, TABLE II.
-  (2019) ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9190–9200. Cited by: item 3, §IV-D, TABLE II.
-  (2016) ENet: A deep neural network architecture for real-time semantic segmentation. CoRR abs/1606.02147. External Links: Cited by: Fig. 1, §II-B, §IV-D, TABLE II.
-  (2019) Fast-SCNN: fast semantic segmentation network.. In British Machine Vision Conference, Cited by: Fig. 1, item 3, §I, §II-B, §IV-B, §IV-D, TABLE II, footnote 1, footnote 1.
-  (2018) ContextNet: exploring context and detail for semantic segmentation in real-time. In British Machine Vision Conference, Cited by: §II-B, §IV-B, §IV-D, TABLE II.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §I.
-  (2018) ERFNet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems 19 (1), pp. 263–272. External Links: Cited by: Fig. 1, §I, §II-B, §IV-D, TABLE II.
-  (2018) MobileNetV2: inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: item 3, §II-B, §II-C, §III-A, §III-B.
-  (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883. Cited by: §III-C.
-  (2016) Is the deconvolution layer the same as a convolutional layer?. CoRR abs/1609.07009. Cited by: §III-C.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §III-A.
-  (2019) High-resolution representations for labeling pixels and regions. CoRR abs/1904.04514. Cited by: §I, §II-A.
-  (2015) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. Cited by: Fig. 1, TABLE II.
-  (2019) LEDNet: a lightweight encoder-decoder network for real-time semantic segmentation. In IEEE International Conference on Image Processing, pp. 1860–1864. Cited by: §II-B, §IV-D, TABLE II.
-  (2019) ThunderNet: a turbo unified network for real-time semantic segmentation. In IEEE Winter Conference on Applications of Computer Vision, pp. 1789–1796. Cited by: §IV-D, TABLE II.
-  (2018) BiSeNet: bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision, pp. 334–349. Cited by: item 3, §I, §II-B, §IV-D, TABLE II.
-  (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: item 3, §II-B, §II-C.
-  (2018) ICNet for real-time semantic segmentation on high-resolution images.. In European Conference on Computer Vision, pp. 405–420. Cited by: Fig. 1, item 3, §I, §II-B, §IV-B, §IV-D, TABLE II.
-  (2016) Pyramid scene parsing network.. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890. Cited by: §I, §II-A.