Convolutional neural networks (CNNs) have brought about a paradigm shift in the field of computer vision, leading to tremendous advances in many tasks [girshick2015fast, he2016deep, krizhevsky2012imagenet, lan2018person, li2017person, simonyan2014very, szegedy2015going]. Semantic segmentation, which associates each pixel to the object class it belongs to, is a computationally expensive task in computer vision [long2015fully]. Fast semantic segmentation is broadly applied to several real-time applications including autonomous driving, medical imaging and robotics [milioto2018real, paszke2016enet, salehi2018real, su2018real]. Accurate CNN-based semantic segmentation requires larger neural networks with deeper and wider layers. These larger networks are therefore not suitable for edge computing devices as they are cumbersome and require substantial resources.
Down-sampling operations, such as pooling and convolutions with stride greater than one, can help decrease the latency of deeper neural networks, however they result in decreased pixel-level accuracy due to the lower resolutions at deeper levels. Many recent approaches employ either encoder-decoder structure[unet, badrinarayanan2017segnet, sun2018fishnet], a two or multi-branch architecture [poudel2019fastscnn, Yu_2018, zhao2018icnet] or dilated convolutions [chen2014semantic, chen2017deeplab, chen2017rethinking, Zhao_2017] to recover spatial information. While these real-time architectures perform appropriately on simple datasets, their performance is sub-optimal for complex datasets possessing more variability in terms of classes, sizes, and shapes. Thus, there is a significant interest in designing CNN architectures that can perform well on complex datasets and, at the same time, are mobile enough to be of practical use in real-time applications.
In this paper, we propose a real-time general purpose semantic segmentation network, RGPNet, that performs well on complex scenarios. RGPNet is based on an asymmetric encoder-decoder structure with a new module called adaptor
in the middle. The adaptor utilizes features at different abstraction levels from both the encoder and decoder to improve the feature refinement at a given level allowing the network to preserve deeper level features with higher spatial resolution. Furthermore, the adaptor enables a better gradient flow from deeper layers to shallower layers by adding shorter paths for the back-propagation. Since training an average deep learning model has a considerable carbon footprint[strubell2019energy], we reduce the training time by with negligible effect on performance by applying progressive resizing for training.
Our main contributions are as follows:
We propose RGPNet as a general real-time semantic segmentation architecture that obtains deep features with high resolution resulting in improved accuracy and lower latency in a single branch network. It performs competitively in complex environments.
We introduce an adaptor module to capture multiple levels of abstraction to help in boundary refinement of segments. The adaptor also aids in gradient flow by adding shorter paths.
Towards green AI, we adopt progressive resizing technique during the training which leads to reduction in training time and the environmental impact. We combat aliasing effect in label map on lower resolutions by employing a modified label relaxation
We optimize RGPNet for deployment on an edge computing device using TensorRT, a platform for high-performance deep learning inference, resulting in 400% increase in inference speed.
We report remarkable results on different datasets evaluated on single scale images. RGPNet achieves , , and mIoU with Resnet-101 backbone and , , and mIoU with Resnet-18 backbone on Cityscpes, CamVid and Mapillary, respectively. For a resolution image, RGPNet obtains 37.4 FPS on NVIDIA GTX2080Ti GPU on the Cityscapes dataset.
2 Related Work
Semantic segmentation lies at the core of computer vision. With the advent of deep learning, long2015fully proposed the seminal fully convolutional network (FCN) with an end-to-end learning approach. However, FCN suffers from the loss of spatial details as it only utilizes high-level features from the last convolutinal layer. Here, we summarize four widely used approaches which have been put forward that increase the feature resolution:
1) Context-based models: To capture the contextual information at multiple scales, DeepLabV2 [chen2014semantic] and DeeplabV3 [chen2017deeplab] exploit multiple parallel atrous convolutions with different dilation rates, while PSPNet [Zhao_2017] performs multi-scale spatial pooling operations. Although these methods encode rich contextual information, they can not capture boundary details effectively due to strided convolution or pooling operations [deeplabv3plus2018].
2) Encoder-decoder structure: Several studies entail encode-decoder structure [unet, badrinarayanan2017segnet, Pohlen_2017, zhuang2018shelfnet, li2018learning, ding2018context, fu2019stacked]. Encoder extracts global contextual information and decoder recovers the spatial information. Deeplabv3+ [deeplabv3plus2018] utilizes an encoder to extracts rich contextual information in conjunction with a decoder to retrieve the missing object boundary details. However, implementation of dilated convolution at higher dilation rates is computationally intensive making them unsuitable for real-time applications.
3) Attention-based models: Attention mechanisms, which help networks to focus on relevant information and ignore the irrelevant information, have been widely used in different tasks, and gained popularity to boost the performance of semantic segmentation. wang2018non formalized self-attention by calculating the correlation matrix between each spatial point in the feature maps in video sequences. To capture contextual information, DaNet [fu2019dual] and OCNet [yuan2018ocnet] apply a self-attention mechanism. DaNet has dual attention modules on position and channels to integrate local features with their respective global dependencies. OCNet, on the other hand, employs the self-attention mechanism to learn the object context map recording the similarities between all the pixels and the associated pixel. PSANet [zhao2018psanet] learns to aggregate contextual information for each individual position via a predicted attention map. Attention based models, however, generally require expensive computation.
4) Multi-Branch models: Another approach to preserve the spatial information is to employ two- or multi-branch approach. The deeper branches extract the contextual information by enlarging receptive fields and shallower branches retain the spatial details. The parallel structure of these networks make them suitable for run time efficient implementations [Yu_2018, zhao2018icnet, poudel2019_fastscnn]. However, they are mostly applicable to the relatively simpler datasets with fewer number of classes. On the other end, HRNet [Sun19hrnet] proposed a model with fully connected links between output maps of different resolutions. This allows the network to generalize better due to multiple paths, acting as ensembles. However, without reduction of spatial dimensions of features, the computational overhead is very high and makes the model no longer feasible for real-time usage.
Building on these observations, we propose a real-time general purpose semantic segmentation architecture that obtains deep features with high resolution resulting in improved accuracy and lower latency in a single branch encoder-decoder network.
3 Proposed Approach
3.1 Structure of RGPNet
RGPNet’s design is based on a light-weight asymmetric encoder-decoder structure for fast and efficient inference. It comprises of three components: an encoder which extracts high-level semantic features, a light asymmetric decoder, and an adaptor which links different stages of encoder and decoder. The encoder decreases the resolution and increases the number of feature maps in the deeper layers, thus it extracts more abstract features in deeper layers with enlarged receptive fields. The decoder reconstructs the lost spatial information. The adaptor amalgamates the information from both encoder and decoder allowing the network to preserve and refine the information between multiple levels.
RGPNet architecture is depicted in Figure 2. In a given row of the diagram, all the tensors have the same spatial resolution with number of channels mentioned in the scheme. Four level outputs from the encoder are extracted at different spatial resolutions , , and with 256, 512, 1024 and 2048 channels, respectively. The number of channels are reduced by a factor of four using
convolutions followed by batch norm and ReLU activation function at each level. These outputs are then passed through a decoder structure with adaptor in the middle. Finally, segmentation output is extracted from the largest resolution viaconvolution to match the number of channels to segmentation categories.
Adaptor: Adaptor acts as a feature refinement module. The presence of an adaptor precludes the need of a symmetrical encoder-decoder structure. It aggregates the features from three different levels, and intermediates between encoder and decoder (Figure 3). The adaptor function is as below:
where superscripts , , and denote adaptor, encoder, and decoder respectively, represents the spatial level in the network. and are downsampling and upsampling functions. Downsampling is carried out by convolution with stride 2 and upsampling is carried out by deconvolution with stride 2 matching spatial resolution as well as the number of channels in the current level. is a transfer function that reduces the number of output channels from an encoder block and transfers them to the adaptor:
are the weight matrix and bias vector,denotes the convolution operation, and denotes the activation function. The decoder contains a modified basic residual block, , where we use shared weights within the block. The decoder function is as follows:
The adaptor has a number of advantages. First, the adaptor aggregates features from different contextual and spatial levels. Second, it facilitates the flow of gradients from deeper layers to shallower layers by introducing a shorter path. Third, the adaptor allows for utilizing asymmetric design with light-weight decoder. This results in fewer convolution layers, further boosting the flow of gradients. The adaptor, therefore, makes the network suitable for real-time applications as it provides rich semantic information while preserving the spatial information.
3.2 Progressive Resizing with Label Relaxations
Progressive resizing is a technique commonly used in classification to reduce the training time. The training starts with smaller image sizes followed by a progressive increase of size until the final stage of the training is conducted using the original image size. For instance, this technique can theoretically speed up the training time by
times per epoch if the image dimensions are decreased byand correspondingly the batch size is increased by a factor of
in a single iteration. However, reducing the image size using nearest neighbour interpolation (bi-linear or bi-cubic interpolation are not applicable), introduces noise around the borders of the objects due to aliasing. Note that inaccurate labelling is another source of noise. To reduce effects of boundary artifacts in progressive resizing, inspired byzhu2018improving, we use an optimized variant of label relaxation method.
In label relaxation along the borders, instead of maximizing likelihood of a target label, likelihood of union of neighbouring pixel labels is maximized. In our implementation, first one-hot labels are created from the label map followed by max-pool operation with stride. This effectively dilates each one-hot label channel transforming it into multi-hot labels along the borders which can then be used to find union of labels along the border pixels. The kernel size of the max pooling controls the width containing pixels being treated as border pixels along the borders. Loss at a given border pixel can be calculated as follows where N is set of border labels:
4 Experimental Results
We conduct experiments on Mapillary [neuhold2017mapillary] as a highly complex dataset, CamVid [brostow2009semantic] and Cityscapes [Cityscapes_Cordts_2016] as moderately complex datasets.
Mapillary consists of high-resolution street-level images taken from many different locations around the globe and under varying conditions annotated for categories. The dataset is split up in a training set of images and a validation set of images.
CamVid consists of 701 low-resolution images in 11 classes which are divided into 376/101/233 image sets for training, validation and testing, respectively. Here, we use the same experimental setup as SegNet [badrinarayanan2017segnet]: image resolution for training and inference, 477 images for training and validation, and 233 image as test set.
Cityscapes contains diverse street level images from different city across Germany and France. It contains classes and only classes of them are used for semantic segmentation evaluation. The dataset contains high quality pixel-level finely annotated images and coarsely annotated images. The finely annotated images are divided into image sets for training, validation and testing. We do not use coarsely annotated data in our experiments.
We implement the RGPNet based on PyTorch framework[paszke2017automatic]. For training on both datasets, we employ a polynomial learning rate policy where the initial learning rate is multiplied by after each iteration. The base learning rate is set to . Momentum and weight decay coefficients are set to and , respectively. We train our model with synchronized batch-norm implementation provided by Zhang_2018_CVPR. Batch size is kept at and trained on two Tesla V100 GPUs. For data augmentation, we apply random cropping and re-scaling with as crop-size. Image base size is for Mapillary and for Cityscapes. Re-scaling is done from range of to respectively followed by random left-right flipping during training.
As a loss function, we use cross entropy with online hard example mining (OHEM)[wu2016high, yuan2018ocnet]
. OHEM only keeps the sample pixels which are hard for the model to predict in a given iteration. The hard sample pixels are determined by probability thresholdfor the corresponding target class, thus the pixels below the threshold are preserved in the training. To have enough representative of each class in the mini batch, the minimal pixel ratio is applied. In our experiments, we set and .
4.1 Results on Mapillary
In this section, we evaluate and compare overall performance of RGPNet with other real-time semantic segmentation methods (BiSeNet [Yu_2018], TASCNet [li2018learning], and ShelfNet [zhuang2018shelfnet]) on Mapillary validation set. we use different feature extractor backbones ResNet [He_2016_Resnet] (R101, R50 and R18), Wide-Resnet [Wider_or_Deeper_Wu_2019] (WRN38), and HarDNet [chao2019hardnet] (HarDNet39D).
Table 1 compares speed (FPS), mIoU and number of parameters on these methods on 16-bit precision computation. RGPNet(R101) achieves mIoU which outperforms TASCNet and ShelfNet with a significant margin and lower latency. Although RGPNet(R101) has more parameters than the TASCNet(R101), both speed and mIoU are considerably higher. However, BiSeNet demonstrates poor performance on Mapillary resulting in the lowest mIoU. Using TensorRT, RGPNet (R101 as the encoder) speeds up to FPS on full image resolution (Table 7). Our method also achieves impressive results with a lighter encoder (R18 or HarDNet39D) surpassing BiSeNet with a heavy backbone (R101) significantly, vs mIoU and 54.4 vs 15.5 FPS. Finally, Figure 4 shows some qualitative results obtained by our model compared to TASCNet and BiSeNet.
4.2 Results on Camvid
In Table 2, we compare overall performance of RGPNet with other real-time semantic segmentation methods (SegNet, FCN [long2015fully], FC-DenseNet [jegou2017one], and FC-HarDNet [chao2019hardnet]) on CamVid test set. We find that RGPNet with R18 and R101 backbones obtain and mIoU with and FPS. RGPNet achieves significant increase in mIoU for Car, Traffic Sign, Pole, and Cyclist categories. Overall we observe that our model outperforms the state-of-the-art real-time segmentation models.
4.3 Results on Cityscapes
Table 3 shows the comparison between our RGPNet and state-of-the-art real-time (BiSeNet, ICNet [zhao2018icnet], FastSCNN [poudel2019fastscnn], and ContextNet [poudel2018contextnet]) and offline (HRNet [Sun19hrnet] and Deeplabv3 [deeplabv3plus2018]) semantic segmentation methods on Cityscapes validation dataset. RGPNet achieves mIoU which is slightly lower than BiSeNet mIoU. ICNet, ContextNet and FastSCNN achieve lower mIoU. Compared to the heavy offline segmentation methods, RGPNet(R101) not only is the fastest, but also outperforms Deeplabv3, BiSeNet (R101) and is comparable to HRNet.
We, therefore, conclude that RGPNet is a real-time general purpose semantic segmentation model that performs competitively a in a wide spectrum of datasets compared to the state-of-the-art semantic segmentation networks designed for specific datasets.
are computed in TensorFlow framework with our in-house implementations.
4.4 Progressive resizing with label relaxation
In order to validate the gain from label relaxation, we compare the result of progressive resizing training with and without label relaxation. In these experiments for the first 100 epochs, the input images are resized by a factor of both in width and height. At the epoch, the image resize factor is set to and, at the epoch, full-sized images are used. With label relaxation, we observe that the model achieves higher mIoU especially at lower resolutions. To further analyze the effect of label relaxation in progressive resizing technique, we illustrate the difference in entropy between two setups (progressive resizing with and without label relaxation). Figure 5 shows that the model trained with label relaxation is more confident in the predictions around the object boundaries.
To examine the energy efficiency, we run two experiments with and without progressive resizing training technique on a single GPU for 15 epochs. In the standard training experiment, we use a full size Cityscapes image . In the progressive resizing training experiment, we start with of image size and then scale up by a factor of 2 at the and the epochs. The speedup factor can theoretically be calculated as . Table 4 shows that the training time reduced from 109 minutes to 32 minutes, close to the speedup expected from theoretical calculation. The energy consumed by GPU decreases by an approximate factor of 4 with little to no drop in the performance. Towards green AI, as a result of remarkable gain in energy efficiency we therefore suggest adopting progressive resizing technique with label relaxation for training a semantic segmentation network.
|Progressive resizing||212||31m 43s||78.8|
|Full scale||873||108m 44s||80.9|
4.5 Ablation study
In this section, we perform an empirical evaluation on the structure of the adaptor module in our design. We remove the downsampling layers which provides information from a higher resolution of encoder to adaptor. Table 5 shows that the performance of our model significantly drops from to on Mapillary validation set. This indicates that the specific design of adaptor has an important role in feature preserving and refinement in our model.
|RGPNet w/o downsampling connections||46.8|
We also perform an ablation study on common techniques used in the literature on Cityscapes dataset. For Cityscapes training, we adopt a pretrained model on Mapillary dataset by sorting the last layer weights according to mapping between Maplillary and Cityscapes categories. This results in more than boost in mIoU.
We use TensorRT optimization for RGPNet and evaluate on Nvidia GTX2080Ti and Xavier. RGPNet obtains and mIoU on Cityscpaes validation with half and full precision floating point format, respectively. The inference speed results for different backbones, two input resolutions using 16-bit and 32-bit floating point numbers are reported in Table 7. RGPNet(R101) using TensorRT on full input resolution leads to a significant increase in speed from 37.8 FPS to 153.4 FPS with 16-bit floating point operations. The speed up with FP16 compared to FP32 is noticeable for all backbones, and two different input resolutions. The results suggest that RGPNet can run high speed on edge computing devices with little or negligible drop in accuracy. A real-world example is provided in Figure 6.
In this paper, we proposed a real-time general purpose semantic segmentation network, RGPNet. It incorporates an adaptor module that aggregates features from different abstraction levels and coordinates between encoder and decoder resulting in better gradient flow. Our conceptually simple yet effective model achieves efficient inference speed and accuracy on resource constrained devices in a wide spectrum of complex domains. By employing a modified progressive resizing training scheme, we reduced training time by more than half with no drop in performance, thereby substantially decreasing the carbon footprint. Furthermore, our experiments demonstrate that RGPNet can generate segmentation results in real-time with comparable accuracy to the state-of-the-art non real-time models. This optimal balance of speed and accuracy makes our model suitable for real-time applications such as autonomous driving where the environment is highly dynamic due to the presence of high variability in real world scenarios.