RGPNet: A Real-Time General Purpose Semantic Segmentation

12/03/2019 ∙ by Elahe Arani, et al. ∙ 15

We propose a novel real-time general purpose semantic segmentation architecture, called RGPNet, which achieves significant performance gain in complex environments. RGPNet consists of a light-weight asymmetric encoder-decoder and an adaptor. The adaptor helps preserve and refine the abstract concepts from multiple levels of distributed representations between encoder and decoder. It also facilitates the gradient flow from deeper layers to shallower layers. Our extensive experiments highlight the superior performance of RGPNet compared to the state-of-the-art semantic segmentation networks. Moreover, towards green AI, we show that using a modified label-relaxation technique with progressive resizing can reduce the training time by up to 60 RGPNet for resource-constrained and embedded devices which increases the inference speed by 400 RGPNet obtains a better speed-accuracy trade-off across multiple datasets.



There are no comments yet.


page 3

page 5

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Schematic illustrations of common semantic segmentation architectures. (a) In context-based networks, dilated convolutions with multiple dilation rates are employed in cascade or in parallel to capture a multi-scale context. (b) In encoder-decoder networks, encoder extracts the features of high-level semantic meaning and decoder densify the features learned by the encoder. (c) In attention-based networks, the feature at each position is selectively aggregated by a weighted sum of the features at all positions. This can be done across channels or spatial dimensions. (d) Multi-branch networks are employed to combine semantic segmentation results at multiple resolution levels. The lower resolution branches yield deeper features with reduced resolution and the higher resolution branches learn spatial details.

Convolutional neural networks (CNNs) have brought about a paradigm shift in the field of computer vision, leading to tremendous advances in many tasks [girshick2015fast, he2016deep, krizhevsky2012imagenet, lan2018person, li2017person, simonyan2014very, szegedy2015going]. Semantic segmentation, which associates each pixel to the object class it belongs to, is a computationally expensive task in computer vision [long2015fully]. Fast semantic segmentation is broadly applied to several real-time applications including autonomous driving, medical imaging and robotics [milioto2018real, paszke2016enet, salehi2018real, su2018real]. Accurate CNN-based semantic segmentation requires larger neural networks with deeper and wider layers. These larger networks are therefore not suitable for edge computing devices as they are cumbersome and require substantial resources.

Down-sampling operations, such as pooling and convolutions with stride greater than one, can help decrease the latency of deeper neural networks, however they result in decreased pixel-level accuracy due to the lower resolutions at deeper levels. Many recent approaches employ either encoder-decoder structure

[unet, badrinarayanan2017segnet, sun2018fishnet], a two or multi-branch architecture [poudel2019fastscnn, Yu_2018, zhao2018icnet] or dilated convolutions [chen2014semantic, chen2017deeplab, chen2017rethinking, Zhao_2017] to recover spatial information. While these real-time architectures perform appropriately on simple datasets, their performance is sub-optimal for complex datasets possessing more variability in terms of classes, sizes, and shapes. Thus, there is a significant interest in designing CNN architectures that can perform well on complex datasets and, at the same time, are mobile enough to be of practical use in real-time applications.

In this paper, we propose a real-time general purpose semantic segmentation network, RGPNet, that performs well on complex scenarios. RGPNet is based on an asymmetric encoder-decoder structure with a new module called adaptor

in the middle. The adaptor utilizes features at different abstraction levels from both the encoder and decoder to improve the feature refinement at a given level allowing the network to preserve deeper level features with higher spatial resolution. Furthermore, the adaptor enables a better gradient flow from deeper layers to shallower layers by adding shorter paths for the back-propagation. Since training an average deep learning model has a considerable carbon footprint

[strubell2019energy], we reduce the training time by with negligible effect on performance by applying progressive resizing for training.

Our main contributions are as follows:

  • We propose RGPNet as a general real-time semantic segmentation architecture that obtains deep features with high resolution resulting in improved accuracy and lower latency in a single branch network. It performs competitively in complex environments.

  • We introduce an adaptor module to capture multiple levels of abstraction to help in boundary refinement of segments. The adaptor also aids in gradient flow by adding shorter paths.

  • Towards green AI, we adopt progressive resizing technique during the training which leads to reduction in training time and the environmental impact. We combat aliasing effect in label map on lower resolutions by employing a modified label relaxation

  • We optimize RGPNet for deployment on an edge computing device using TensorRT, a platform for high-performance deep learning inference, resulting in 400% increase in inference speed.

  • We report remarkable results on different datasets evaluated on single scale images. RGPNet achieves , , and mIoU with Resnet-101 backbone and , , and mIoU with Resnet-18 backbone on Cityscpes, CamVid and Mapillary, respectively. For a resolution image, RGPNet obtains 37.4 FPS on NVIDIA GTX2080Ti GPU on the Cityscapes dataset.

2 Related Work

Figure 2:

Network schematic diagram of the proposed architecture, RGPNet. Rectangular boxes depict tensor at a given level with number of channels mentioned as their labels. Color coded arrows represent the convolution operations indicated by the legend.

Semantic segmentation lies at the core of computer vision. With the advent of deep learning, long2015fully proposed the seminal fully convolutional network (FCN) with an end-to-end learning approach. However, FCN suffers from the loss of spatial details as it only utilizes high-level features from the last convolutinal layer. Here, we summarize four widely used approaches which have been put forward that increase the feature resolution:

1) Context-based models: To capture the contextual information at multiple scales, DeepLabV2 [chen2014semantic] and DeeplabV3 [chen2017deeplab] exploit multiple parallel atrous convolutions with different dilation rates, while PSPNet [Zhao_2017] performs multi-scale spatial pooling operations. Although these methods encode rich contextual information, they can not capture boundary details effectively due to strided convolution or pooling operations [deeplabv3plus2018].

2) Encoder-decoder structure: Several studies entail encode-decoder structure [unet, badrinarayanan2017segnet, Pohlen_2017, zhuang2018shelfnet, li2018learning, ding2018context, fu2019stacked]. Encoder extracts global contextual information and decoder recovers the spatial information. Deeplabv3+ [deeplabv3plus2018] utilizes an encoder to extracts rich contextual information in conjunction with a decoder to retrieve the missing object boundary details. However, implementation of dilated convolution at higher dilation rates is computationally intensive making them unsuitable for real-time applications.

3) Attention-based models: Attention mechanisms, which help networks to focus on relevant information and ignore the irrelevant information, have been widely used in different tasks, and gained popularity to boost the performance of semantic segmentation. wang2018non formalized self-attention by calculating the correlation matrix between each spatial point in the feature maps in video sequences. To capture contextual information, DaNet [fu2019dual] and OCNet [yuan2018ocnet] apply a self-attention mechanism. DaNet has dual attention modules on position and channels to integrate local features with their respective global dependencies. OCNet, on the other hand, employs the self-attention mechanism to learn the object context map recording the similarities between all the pixels and the associated pixel. PSANet [zhao2018psanet] learns to aggregate contextual information for each individual position via a predicted attention map. Attention based models, however, generally require expensive computation.

4) Multi-Branch models: Another approach to preserve the spatial information is to employ two- or multi-branch approach. The deeper branches extract the contextual information by enlarging receptive fields and shallower branches retain the spatial details. The parallel structure of these networks make them suitable for run time efficient implementations [Yu_2018, zhao2018icnet, poudel2019_fastscnn]. However, they are mostly applicable to the relatively simpler datasets with fewer number of classes. On the other end, HRNet [Sun19hrnet] proposed a model with fully connected links between output maps of different resolutions. This allows the network to generalize better due to multiple paths, acting as ensembles. However, without reduction of spatial dimensions of features, the computational overhead is very high and makes the model no longer feasible for real-time usage.

Building on these observations, we propose a real-time general purpose semantic segmentation architecture that obtains deep features with high resolution resulting in improved accuracy and lower latency in a single branch encoder-decoder network.

3 Proposed Approach

3.1 Structure of RGPNet

RGPNet’s design is based on a light-weight asymmetric encoder-decoder structure for fast and efficient inference. It comprises of three components: an encoder which extracts high-level semantic features, a light asymmetric decoder, and an adaptor which links different stages of encoder and decoder. The encoder decreases the resolution and increases the number of feature maps in the deeper layers, thus it extracts more abstract features in deeper layers with enlarged receptive fields. The decoder reconstructs the lost spatial information. The adaptor amalgamates the information from both encoder and decoder allowing the network to preserve and refine the information between multiple levels.

RGPNet architecture is depicted in Figure 2. In a given row of the diagram, all the tensors have the same spatial resolution with number of channels mentioned in the scheme. Four level outputs from the encoder are extracted at different spatial resolutions , , and with 256, 512, 1024 and 2048 channels, respectively. The number of channels are reduced by a factor of four using

convolutions followed by batch norm and ReLU activation function at each level. These outputs are then passed through a decoder structure with adaptor in the middle. Finally, segmentation output is extracted from the largest resolution via

convolution to match the number of channels to segmentation categories.

Figure 3: Adaptor module: the adaptor fuses information from multiple abstraction levels; , , and denote the transfer, downsampling and upsampling functions, respectively. is the decoder block with shared weights between layers.

Adaptor: Adaptor acts as a feature refinement module. The presence of an adaptor precludes the need of a symmetrical encoder-decoder structure. It aggregates the features from three different levels, and intermediates between encoder and decoder (Figure 3). The adaptor function is as below:


where superscripts , , and denote adaptor, encoder, and decoder respectively, represents the spatial level in the network. and are downsampling and upsampling functions. Downsampling is carried out by convolution with stride 2 and upsampling is carried out by deconvolution with stride 2 matching spatial resolution as well as the number of channels in the current level. is a transfer function that reduces the number of output channels from an encoder block and transfers them to the adaptor:


where and

are the weight matrix and bias vector,

denotes the convolution operation, and denotes the activation function. The decoder contains a modified basic residual block, , where we use shared weights within the block. The decoder function is as follows:


The adaptor has a number of advantages. First, the adaptor aggregates features from different contextual and spatial levels. Second, it facilitates the flow of gradients from deeper layers to shallower layers by introducing a shorter path. Third, the adaptor allows for utilizing asymmetric design with light-weight decoder. This results in fewer convolution layers, further boosting the flow of gradients. The adaptor, therefore, makes the network suitable for real-time applications as it provides rich semantic information while preserving the spatial information.

3.2 Progressive Resizing with Label Relaxations

Progressive resizing is a technique commonly used in classification to reduce the training time. The training starts with smaller image sizes followed by a progressive increase of size until the final stage of the training is conducted using the original image size. For instance, this technique can theoretically speed up the training time by

times per epoch if the image dimensions are decreased by

and correspondingly the batch size is increased by a factor of

in a single iteration. However, reducing the image size using nearest neighbour interpolation (bi-linear or bi-cubic interpolation are not applicable), introduces noise around the borders of the objects due to aliasing. Note that inaccurate labelling is another source of noise. To reduce effects of boundary artifacts in progressive resizing, inspired by

zhu2018improving, we use an optimized variant of label relaxation method.

In label relaxation along the borders, instead of maximizing likelihood of a target label, likelihood of union of neighbouring pixel labels is maximized. In our implementation, first one-hot labels are created from the label map followed by max-pool operation with stride

. This effectively dilates each one-hot label channel transforming it into multi-hot labels along the borders which can then be used to find union of labels along the border pixels. The kernel size of the max pooling controls the width containing pixels being treated as border pixels along the borders. Loss at a given border pixel can be calculated as follows where N is set of border labels:

Figure 4: Semantic segmentation results on Mapillary Vistas validation set. The columns correspond to input image, the output of RGPNet, the output of TASCNet, the output of BiSeNet, and the ground-truth annotation. For all methods R101 is used as the backbone. RGPNet mainly improves the results on road and road-related objects’ pixels. Best viewed in color and with digital zoom.

4 Experimental Results

We conduct experiments on Mapillary [neuhold2017mapillary] as a highly complex dataset, CamVid [brostow2009semantic] and Cityscapes [Cityscapes_Cordts_2016] as moderately complex datasets.

Mapillary consists of high-resolution street-level images taken from many different locations around the globe and under varying conditions annotated for categories. The dataset is split up in a training set of images and a validation set of images.

CamVid consists of 701 low-resolution images in 11 classes which are divided into 376/101/233 image sets for training, validation and testing, respectively. Here, we use the same experimental setup as SegNet [badrinarayanan2017segnet]: image resolution for training and inference, 477 images for training and validation, and 233 image as test set.

Cityscapes contains diverse street level images from different city across Germany and France. It contains classes and only classes of them are used for semantic segmentation evaluation. The dataset contains high quality pixel-level finely annotated images and coarsely annotated images. The finely annotated images are divided into image sets for training, validation and testing. We do not use coarsely annotated data in our experiments.

We implement the RGPNet based on PyTorch framework

[paszke2017automatic]. For training on both datasets, we employ a polynomial learning rate policy where the initial learning rate is multiplied by after each iteration. The base learning rate is set to . Momentum and weight decay coefficients are set to and , respectively. We train our model with synchronized batch-norm implementation provided by Zhang_2018_CVPR. Batch size is kept at and trained on two Tesla V100 GPUs. For data augmentation, we apply random cropping and re-scaling with as crop-size. Image base size is for Mapillary and for Cityscapes. Re-scaling is done from range of to respectively followed by random left-right flipping during training.

As a loss function, we use cross entropy with online hard example mining (OHEM)

[wu2016high, yuan2018ocnet]

. OHEM only keeps the sample pixels which are hard for the model to predict in a given iteration. The hard sample pixels are determined by probability threshold

for the corresponding target class, thus the pixels below the threshold are preserved in the training. To have enough representative of each class in the mini batch, the minimal pixel ratio is applied. In our experiments, we set and .

4.1 Results on Mapillary

In this section, we evaluate and compare overall performance of RGPNet with other real-time semantic segmentation methods (BiSeNet [Yu_2018], TASCNet [li2018learning], and ShelfNet [zhuang2018shelfnet]) on Mapillary validation set. we use different feature extractor backbones ResNet [He_2016_Resnet] (R101, R50 and R18), Wide-Resnet [Wider_or_Deeper_Wu_2019] (WRN38), and HarDNet [chao2019hardnet] (HarDNet39D).

Table 1 compares speed (FPS), mIoU and number of parameters on these methods on 16-bit precision computation. RGPNet(R101) achieves mIoU which outperforms TASCNet and ShelfNet with a significant margin and lower latency. Although RGPNet(R101) has more parameters than the TASCNet(R101), both speed and mIoU are considerably higher. However, BiSeNet demonstrates poor performance on Mapillary resulting in the lowest mIoU. Using TensorRT, RGPNet (R101 as the encoder) speeds up to FPS on full image resolution (Table 7). Our method also achieves impressive results with a lighter encoder (R18 or HarDNet39D) surpassing BiSeNet with a heavy backbone (R101) significantly, vs mIoU and 54.4 vs 15.5 FPS. Finally, Figure 4 shows some qualitative results obtained by our model compared to TASCNet and BiSeNet.

Model(backbone) FPS mIoU(%) Params(M)
BiSeNet(R101) 15.5 20.4 50.1
TASCNet(R50) 17.6 46.4 32.8
TASCNet(R101) 13.9 48.8 51.8
ShelfNet(R101) 14.8 49.2 57.7
RGPNet(R101) 18.2 50.2 52.2
RGPNetB(WRN38) 5.72 53.1 215
RGPNet(HarDNet39D) 46.6 42.5 9.4
RGPNet(R18) 54.4 41.7 17.8
Table 1: Mapillary Vistas validation set results. The experiments are conducted using 16-bit floating point (FP16) numbers.

4.2 Results on Camvid

In Table 2, we compare overall performance of RGPNet with other real-time semantic segmentation methods (SegNet, FCN [long2015fully], FC-DenseNet [jegou2017one], and FC-HarDNet [chao2019hardnet]) on CamVid test set. We find that RGPNet with R18 and R101 backbones obtain and mIoU with and FPS. RGPNet achieves significant increase in mIoU for Car, Traffic Sign, Pole, and Cyclist categories. Overall we observe that our model outperforms the state-of-the-art real-time segmentation models.
















Pixel Acc.

SegNet 29.5 63.0 68.7 52.0 87.0 58.5 13.4 86.2 25.3 17.9 16.0 60.5 24.8 46.4 62.5
FCN8 135 47.6 77.8 71.0 88.7 76.1 32.7 91.2 41.7 24.4 19.9 72.7 31.0 57.0 88.0
FC-DenseNet56 1.4 38.2 77.6 72.0 92.4 73.2 31.8 92.8 37.9 26.2 32.6 79.9 31.1 58.9 88.9
FC-DenseNet103 9.4 20.4 83.0 77.3 93.0 77.3 43.9 94.5 59.6 37.1 37.8 82.2 50.5 66.9 91.5
FC-HarDNet68 1.4 75.2 80.8 74.4 92.7 76.1 40.6 93.3 47.9 29.3 33.3 78.3 45.7 62.9 90.2
FC-HarDNet84 8.4 34.8 81.4 76.2 92.9 78.3 48.9 94.6 61.9 37.9 38.2 80.5 54.0 67.7 91.1
RGPNet(R18) 17.7 190 82.6 75.5 91.2 85.1 54.3 94.1 61.5 50.4 36.8 82.2 59.8 66.9 90.2
RGPNet(R101) 50.1 68.2 85.8 77.3 91.2 87.0 62.5 90.6 67.6 51.4 46.8 70.7 67.2 69.2 89.9
Table 2: CamVid test set results. The inference times are calculated on a single NVIDIA TitanV GPU with a single-image batch size.

4.3 Results on Cityscapes

Table 3 shows the comparison between our RGPNet and state-of-the-art real-time (BiSeNet, ICNet [zhao2018icnet], FastSCNN [poudel2019fastscnn], and ContextNet [poudel2018contextnet]) and offline (HRNet [Sun19hrnet] and Deeplabv3 [deeplabv3plus2018]) semantic segmentation methods on Cityscapes validation dataset. RGPNet achieves mIoU which is slightly lower than BiSeNet mIoU. ICNet, ContextNet and FastSCNN achieve lower mIoU. Compared to the heavy offline segmentation methods, RGPNet(R101) not only is the fastest, but also outperforms Deeplabv3, BiSeNet (R101) and is comparable to HRNet.

We, therefore, conclude that RGPNet is a real-time general purpose semantic segmentation model that performs competitively a in a wide spectrum of datasets compared to the state-of-the-art semantic segmentation networks designed for specific datasets.

Model mIoU (%) FPS
Backbone Head SS MS
R18 BiSeNet 74.8* 78.6* 40.4
PSPNet50 ICNet 67.7* - 40.6
N/A FastSCNN 68.1 - 43.5
N/A ContextNet 60.6 - 37.9
R18 RGPNet 74.1 76.4 37.8
W48 HRNet 81.1* - OOM
R101(OS-8) Deeplabv3 77.82* 79.30* 2.48
R101 BiSeNet - 80.3* 10.4
R101 RGPNet 80.9 81.9 10.9
Table 3: Cityscapes validation set result on full resolution image. Numbers with * are taken from respective paper. SS and MS denote single-scale and multi-scale. OOM stands for out-of-memory error. Numbers with

are computed in TensorFlow framework with our in-house implementations.

4.4 Progressive resizing with label relaxation

In order to validate the gain from label relaxation, we compare the result of progressive resizing training with and without label relaxation. In these experiments for the first 100 epochs, the input images are resized by a factor of both in width and height. At the epoch, the image resize factor is set to and, at the epoch, full-sized images are used. With label relaxation, we observe that the model achieves higher mIoU especially at lower resolutions. To further analyze the effect of label relaxation in progressive resizing technique, we illustrate the difference in entropy between two setups (progressive resizing with and without label relaxation). Figure 5 shows that the model trained with label relaxation is more confident in the predictions around the object boundaries.

Figure 5: (a) Training profiles of progressive resizing with (red) and without (green) label relaxation experiments conducted on Cityscapes validation set. Specially at lower resolutions, label relaxation helps in achieving higher mIoU. (b) Heatmap of difference in entropy between label relaxation and without label relaxation based trained model evaluated on a sample image from validation set. On boundaries of objects, models trained with label relaxation are more confident about the label and hence have lower entropy (blue shades).

Green AI.

To examine the energy efficiency, we run two experiments with and without progressive resizing training technique on a single GPU for 15 epochs. In the standard training experiment, we use a full size Cityscapes image . In the progressive resizing training experiment, we start with of image size and then scale up by a factor of 2 at the and the epochs. The speedup factor can theoretically be calculated as . Table 4 shows that the training time reduced from 109 minutes to 32 minutes, close to the speedup expected from theoretical calculation. The energy consumed by GPU decreases by an approximate factor of 4 with little to no drop in the performance. Towards green AI, as a result of remarkable gain in energy efficiency we therefore suggest adopting progressive resizing technique with label relaxation for training a semantic segmentation network.

Training Scheme Energy(KJ) Time mIoU(%)
Progressive resizing 212 31m 43s 78.8
Full scale 873 108m 44s 80.9
Table 4: Progressive resizing result on energy efficiency. mIoU reported here are from the complete experiment.
Figure 6: Results obtained by RGPNet on Cityscapes val test. Top row: input image, ground-truth annotation, and label maps. Bottom row: the output of PyTorch model, TensorRT FP16 model, and TensorRT FP32 model. The results show that optimization on TensorRT on half and full precision floating point format does not affect the qualitative outputs.

4.5 Ablation study

In this section, we perform an empirical evaluation on the structure of the adaptor module in our design. We remove the downsampling layers which provides information from a higher resolution of encoder to adaptor. Table 5 shows that the performance of our model significantly drops from to on Mapillary validation set. This indicates that the specific design of adaptor has an important role in feature preserving and refinement in our model.

Method mIoU(%)
RGPNet 50.2
RGPNet w/o downsampling connections 46.8
Table 5: Ablation study result on Mapillary validation set: the effect of downsampling layers which are shown in red in Figure 2 from adaptor.

We also perform an ablation study on common techniques used in the literature on Cityscapes dataset. For Cityscapes training, we adopt a pretrained model on Mapillary dataset by sorting the last layer weights according to mapping between Maplillary and Cityscapes categories. This results in more than boost in mIoU.

CE OHEM PM MS+Flip mIoU(%)
Table 6: Ablation study on Cityscapes validation set with RGPNet(R101). CE, OHEM, PM, denote cross-entropy loss, online hard example mining loss, and pretrained model on Mapillary, respectively. MS+Flip stands for multi-scale evaluation with left/right image flip.

4.6 TensorRT

We use TensorRT optimization for RGPNet and evaluate on Nvidia GTX2080Ti and Xavier. RGPNet obtains and mIoU on Cityscpaes validation with half and full precision floating point format, respectively. The inference speed results for different backbones, two input resolutions using 16-bit and 32-bit floating point numbers are reported in Table 7. RGPNet(R101) using TensorRT on full input resolution leads to a significant increase in speed from 37.8 FPS to 153.4 FPS with 16-bit floating point operations. The speed up with FP16 compared to FP32 is noticeable for all backbones, and two different input resolutions. The results suggest that RGPNet can run high speed on edge computing devices with little or negligible drop in accuracy. A real-world example is provided in Figure 6.

Backbone Nvidia GTX2080Ti Xavier
FP16 FP32 FP16 FP32 FP16 FP32 FP16 FP32
Resnet18 430.2 180.9 153.4 47.2 78.45 24.6 20.8 6.17
Resnet50 265.7 88.8 87.2 24.3 44.6 12.6 11.7 3.17
Resnet101 176.9 58.5 61.9 15.5 30.3 8.14 7.89 2.05
Table 7: RGPNet inference speed using TensorRT on Nvidia GTX2080Ti and Xavier.

5 Conclusion

In this paper, we proposed a real-time general purpose semantic segmentation network, RGPNet. It incorporates an adaptor module that aggregates features from different abstraction levels and coordinates between encoder and decoder resulting in better gradient flow. Our conceptually simple yet effective model achieves efficient inference speed and accuracy on resource constrained devices in a wide spectrum of complex domains. By employing a modified progressive resizing training scheme, we reduced training time by more than half with no drop in performance, thereby substantially decreasing the carbon footprint. Furthermore, our experiments demonstrate that RGPNet can generate segmentation results in real-time with comparable accuracy to the state-of-the-art non real-time models. This optimal balance of speed and accuracy makes our model suitable for real-time applications such as autonomous driving where the environment is highly dynamic due to the presence of high variability in real world scenarios.