Semantic segmentation is a major challenge in computer vision that aims to assign a label to every pixel in an image.[6, 16] Fully convolutional networks are shown to be the state-of-art approach in semantic segmentation tasks over the recent years and offer simplicity and speed during learning and inference. Such networks have a broad range of applications such as autonomous driving, robotic navigation, localization, and scene understanding.
These architectures are trained by supervised learning over numerous images and detailed annotations for each pixel. Data sets that offer semantic segmentation annotations on a rich variety of objects and stuff categories have emerged such as COCO, ADE20K, Cityscapes, PASCAL VOC, thus opening new windows in the field.
Computationally efficient convolutional networks have been gaining momentum over the recent years but the segmentation task is still an open problem. Proposed networks for the semantic segmentation task are deep and resource hungry because of their purpose of achieving the highest accuracy such as [4, 6, 24, 20]
. These approaches have high complexity and may contain custom operations which are not suitable to be run on current implementations of neural network interpreters offered for mobile devices. Such devices lack the computation power of specialised GPUs, resulting in very poor inference speed. Mobile capable approaches in the feature extraction task such as Mobilenet V2, ShuffleNet V2 have motivated us to explore the performance of such architectures to be used in this context.
In this paper, we explore ShuffleNet V2, as the feature extractor with simplified DeepLabV3+ heads and recently proposed DPC architecture, and report our findings of both model on scene understanding using Cityscapes data set. Furthermore, we present the number of floating point operations(FLOPs) and on-device inference performance of each approach. Our contributions to the field can be listed in three points:
We achieve state-of-art computation efficiency in the semantic segmentation task while achieving mean intersection over union (mIOU) on Cityscapes test set using ShuffleNet V2 along with DPC encoder and a naive decoder module.
Our proposed model and implementation is fully compatible with TensorFlow Lite and runs real-time on Android and iOS-based mobile phones.
2 Related Work
In this section, we talk about the current state-of-art in the task of semantic segmentation, especially mobile capable approaches and performance metrics to measure the efficiency of networks.
CNNs have shown to be the state-of-art method for the task of semantic segmentation over the recent years. Especially fully convolutional neural networks (FCNNs) have demonstrated great performance on feature generation task and end-to-end training and hence is widely used in semantic segmentation as encoders. Moreover, memory friendly and computationally light designs such as [13, 21, 23, 18]
, have shown to perform well in speed-accuracy trade-off by taking advantage of approaches such as depthwise separable convolution, bottleneck design and batch normalization. These efficient designs are promising for usage on mobile CPUs and GPUs, hence motivated us to use such networks as encoders for the challenging task of semantic segmentation.
. But these approaches use deep feature generators and complex reconstruction methods for the task, thus making them unsuitable for mobile use, especially for the application of autonomous cars where resources are scarce and computation delays are undesired.
In this sense, one of the recent proposals in feature generation, ShuffleNet V2, demonstrates significant efficiency boost over the others while performing accurately. According to , there are four main guidelines to follow for achieving a highly efficient network design.
When the channel widths are not equal, there is an increase in the memory access cost (MAC) and thus, channel widths should be kept equal.
Excessive use of group convolutions should be avoided as they raise the MAC.
Fragmentation in the network should be avoided to keep the degree of parallelism high.
Element-wise operations such as ReLU, Add, AddBias are non-negligible and should be reduced.
To achieve such accuracy at low computation latency, they point out two main reasons. First, their guidelines for efficiency allow each building block to use more feature channels and have a bigger network capacity. Second, they achieve a kind of “feature reuse” by their approach of keeping half of the feature channels pass through the block to join the next block.
Another important issue is the metric of performance for convolutional neural networks. The efficiency of CNNs is commonly reported by the total number of floating point operations (FLOPs). It is pointed out in  that, despite their similar number of FLOPs, networks may have different inference speeds, emphasizing that this metric alone can be misleading and may lead to poor designs. They argue that discrepancy can be due to memory access cost (MAC), parallelism capability of the design and platform dependent optimizations on specific operations such as cuDNN’s Conv. Furthermore, they offer to use a direct metric (e.g., speed) instead of an indirect metric such as FLOPs.
Next, we shall examine the state-of-art on the semantic image segmentation task by focusing on the computationally efficient approaches.
Atrous convolutions, or dilated convolutions, are shown to be a powerful tool in the semantic segmentation task 
. By using atrous convolutions it is possible to use pretrained ImageNet networks such as[18, 21] to extract denser feature maps by replacing downscaling at the last layers with atrous rates, thus allowing us to control the dimensions of the features. Furthermore, they can be used to enlarge the field of view of the filters to embody multi-scale context. Examples of atrous convolutions at different rates are shown in Figure 1.
DeepLabV3+ DPC  achieves state-of-art accuracy when it is combined with their modified version of Xception backbone. In their work, Mobilenet V2 has shown to have a correlation of accuracy with Xception while having a shorter training time, and thus it is used in the random search of a dense prediction cell (DPC). Our work is inspired by the accuracy that they have achieved with the Mobilenet V2 backbone on Cityscapes set in , and their approach of combining atrous separable convolutions with spatial pyramid pooling in . To be more specific, we use lightweight prediction cell (denoted as basic) and DPC which were used on the Mobilenet V2 features, and the atrous separable convolutions on the bottom layers of feature extractor in order to keep higher resolution features.
Semantic segmentation as a real-time task has gained momentum on popularity recently. ENet  is an efficient and lightweight network offering low number of FLOPs in the design and ability to run real-time on NVIDIA TX1 by taking advantage of bottleneck module. Recently, ENet was further fine-tuned by 
, increasing the Cityscapes mean intersection over union from 58.29% to 63.06% by using a new loss function called Lovasz-Softmax. Furthermore, SHUFFLESEG, demonstrates different decoders that can be used for Shufflenet, prior work to ShuffleNet V2, comparing their efficiency mainly with ENet and SegNet by FLOPs and mIOU metrics but they do not mention any direct speed metric that is suggested by ShuffleNet V2. The most comprehensive work on the search for an efficient real-time network was done in  and they report that SkipNet-Shufflenet combination runs frames per second (fps) with an image resolution of on Jetson TX2. This work again, like SHUFFLESEG, is based on the prior design of the channel shuffle based approach.
Our literature review showed us that the ShuffleNet V2 architecture is yet to be used in semantic segmentation task as a feature generator. Both  and SHUFFLESEG point out the low FLOP achievable by using ShuffleNet and show comparable accuracy and fast inference speeds. In this work, we exploit improved ShuffleNet V2 as an encoder module modified by atrous convolutions and well-proven encoder heads of DeepLabV3 and DPC in conjunction. Then, we evaluate the network on Cityscapes, a challenging task in the field of scene parsing.
This section describes our network architecture, and the training procedures, and evaluation methods. The network architecture is based upon the state-of-art efficient encoder, ShuffleNet V2, and DeepLabV3 and DPC heads built on top to perform segmentation. Training procedure includes restoring from an Imagenet checkpoint, pre-training on MS COCO 2017 and Cityscapes coarse annotations, and fine-tuning on fine annotations of Cityscapes. We then evaluate the trained network on Cityscapes validation set according to their evaluation procedure in .
3.1 Network Architecture
In this section, we will give a detailed explanation of each step of the proposed network. Figure 2 shows the different stages of the network and how the data flows from the start to the end. We start with Shufflenet V2 feature extractor, then add the encoder head (DPC or basic DeepLabV3), and finally use resize bilinear as a naive decoder to produce the segmentation mask. The final downsampling factor of the feature extractor is denoted as .
For the task of feature extraction, we choose Shufflenet V2 because of its success for speed versus accuracy trade-off. Our selection for the of ShuffleNet V2 is as it can be seen in the output channels column of Table 1. Our decision was purely made by accuracy and speed results of this variation on the ImageNet data set and we have not run experiments on different variations of the
. Although, one might choose lower values for this hyperparameter to achieve faster inference time by compromising on accuracy and conversely higher values might result in a gain of accuracy in favour of inference speed.
Figure 3 shows the building blocks of the feature extractor architecture in detail. Each stage consists of one spatial downsampling unit and several basic units. In the original implementation of Shufflenet V2, goes as low as . In our approach, entry flow, and of the proposed feature extractor is implemented as in . In the case of , the last stage, namely , has been modified by setting instead of on the downsampling layer and the atrous rate of the preceding depth-wise convolutions are set to divided by as described in  to adjust the final downsampling factor of the feature extractor. However, in the case of
, stride and atrous rate modification starts from. We choose to focus on due to its faster computation speed.
After ShuffleNet V2 features are extracted we employ the DPC encoder. Figure 4 shows the design of the basic encoder head and the DPC head. The basic encoder head does not contain any atrous convolutions in its design for lower complexity. On the other hand, DPC is using five different depthwise convolutions at different rates to understand features better. After the encoder heads, as seen in Figure 5, features are reduced down to
, then to the number of classes. Afterwards, a drop out layer with 0.9 probability of keeping is applied.
For decoding, we are using the same naive approach as , where a simple bilinear resizing is applied to upscale back to the original image dimensions. In the case of , the upscaling factor is 16. In their later work, DeepLabV3+, they propose an approach where features from earlier stages (e.g. of ) are concatenated with the upsampled features of the last layer where the upscaling factor is divided by to preserve the finer details in the segmentation result. We have not included this part as it adds more complexity and slowing down the inference time.
|Layer||Output Size||Kernel||Stride||Rate||Repeat||Output Channels|
Proposed feature extractor weights are initially restored from an ImageNet checkpoint. After that, we pre-train the network end-to-end on MS COCO. Further pre-training was done on Cityscapes coarse annotations before we finalize our training on Cityscapes fine annotations. During the training at each step data augmentation was performed on the images by randomly scaling between and with steps of , randomly cropping by and randomly flipping left and right.
For preprocessing the input images, we standardize each pixel to range according to Equation 1.
3.2.2 Ms Coco 2017
We pre-train the network on MS COCO as suggested in [4, 6, 5]. Only the relevant classes to the Cityscapes task, which are person, car, truck, bus, train, motorcycle, bicycle, stop sign and parking meter, were chosen and other classes were marked as background. Also, we further filter the samples by criteria of having a non-background class area of over pixels. This yields us training and validation samples. Training was done end-to-end by using a batch size of , of , weight decay set to to prevent over-fitting, and a “poly” learning rate policy as shown in Equation 2 where is the learning rate at step , set to and set to  for 60K steps using Adam optimizer(, , ).
First, we pre-train the network further using 20K coarsely annotated samples for 60K steps, following the same parameters that we used in MS COCO training. Then, as the final step, we fine-tune the network for 120K steps on finely annotated set ( train, validation samples) with an initial learning rate of , a slow start at learning rate for the first 186 steps. We found out that lowering the learning rate at the fine-tuning stage is crucial to achieving good accuracy. Other parameters are kept the same as the previous steps.
4 Experimental Results on Cityscapes
Cityscapes is a well-studied data set in the field of scene parsing task. It contains roughly 20K coarsely annotated samples, and 5000 finely annotated samples which are split into 2975, 500 and 1525 for training, validation, and testing respectively.
Our approach of combining ShuffleNet V2 with DeepLabV3 basic and DPC head achieves both state-of-art mIOU and inference speed with the respective GFLOPs counts. Table 2 shows the comparison of GFLOPs to mIOU performance on the validation set. ShuffleNet V2 with basic DeepLabV3 encoder head, having GFLOPs, surpasses the SkipNet-ShuffleNet which has similar floating operations count by gain on mIOU. On the other hand, DPC variation performs more accurate than the Mobilenet V2 with basic heads, which has times more GFLOPs.
ShuffleNet V2 with DPC heads outperforms state-of-art efficient networks on the Cityscapes test set. Table 3 shows that we make a gain of 7.27% mIOU over the best performing ENet architecture. Furthermore, on each of the classes highlighted here, ShuffleNet V2DPC shows a great improvement. Also, as seen in the Table 4, our approach outperforms the previous methods on the category mIOU and instance level metrics.
We visualize the segmentation masks generated both by the basic and DPC approach in Figure 6. From the visuals, we can see that the DPC head variant is able to segment thinner objects such as poles better than the basic encoder head. The figure is best viewed in colour.
|Method||Class IOU||Class iIOU||Cat. IOU||Cat. iIOU|
|ShuffleNet V2+DPC (Ours)||70.33||43.58||86.48||69.92|
4.1 Inference Speed on Mobile Phone
In this section, we provide inference speed results and our procedure for the measurements. As discussed in ShuffleNet V2 , we believe that the computational efficiency of the network should be measured on the target devices rather than purely comparing by the FLOPs. Thus, we convert the competing networks to Tensorflow Lite, a binary model representation for inferencing on mobile phones, for the speed evaluation. Tensorflow Lite has a limited number of available operations in its current implementation, thus complex architectures such as  cannot be converted without implementing the missing operations on the Tensorflow Lite. But, both Mobilenet V2 and ShuffleNet V2 are compatible and require no additional custom operations.
The measurements were done on OnePlus A6003, Snapdragon 845 CPU, 6GB RAM, 16 + 20 MP Dual Camera, Android version 9, OxygenOS version 9.0.3. The device is stabilized in place, put to airplane mode, background services are disabled, the battery is fully charged and kept plugged to the power cable. The rear camera output image is scaled so that the smaller dimension is 224, then the middle of the image is cropped to get an input image of . The downscaling and cropping is not included in the inference time shown in Table 5 however the preprocessing of the input image, according to Equation 1, and the final ArgMax is included in the inference time. Inference speed measurements, presented in Table 5, are averaged over 300 frames after 30 seconds of an initial warm-up period.
shows that our approach of using ShuffleNet V2 as a backbone is capable of performing real-time, close to 20Hz, semantic segmentation on a mobile phone. Even with the complex DPC architecture, ShuffleNet V2 backbone ends up with 1.54 times fewer GFLOPs over the Mobilenet V2 with basic encoder head. Furthermore, model size is significantly lower which is desired for embedded devices that might have limited memory or storage size. One surprising finding is that the Mobilenet V2 variants show much higher variance in the inference time compared to the ShuffleNet V2 backbone, thus our approach provides a more stable fps.
|Backbone||Encoder||GFLOPs||Inference (ms)||Var (ms)||FPS||Size (MB)|
In this paper, we have described an efficient solution for semantic segmentation by achieving state-of-art inference speed without compromising on the accuracy. Our approach shows that ShuffleNet V2 is a powerful and efficient backbone for the task of semantic segmentation. It achieves 70.33% mIOU when combined with the DPC head, and 67.7% mIOU222The validation set result. combined with the basic encoder head on Cityscapes challenge. Furthermore, we showed that our network is capable of running real-time, the DPC head at 15.41 fps and the basic encoder head at 19.65 fps, on a mobile phone with an input image size of . Future work is to implement an efficient decoder architecture to get more refined edges along the borders of the objects on the segmentation mask.
-  Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(12), 2481–2495 (Dec 2017)
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal of Machine Learning Research13, 281–305 (2012)
Berman, M., Rannen Triki, A., Blaschko, M.B.: The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4413–4421 (2018)
-  Chen, L.C., Collins, M., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., Adam, H., Shlens, J.: Searching for efficient multi-scale architectures for dense image prediction. In: Advances in Neural Information Processing Systems. pp. 8713–8724 (2018)
-  Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
-  Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. Lecture Notes in Computer Science p. 833–851 (2018)
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)
-  Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2016)
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09 (2009)
-  Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111(1), 98–136 (Jan 2015)
-  Gamal, M., Siam, M., Abdel-Razek, M.: Shuffleseg: Real-time semantic segmentation network. arXiv preprint arXiv:1803.03816 (2018)
-  He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
-  Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
-  Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37. pp. 448–456. ICML’15, JMLR.org (2015)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. Lecture Notes in Computer Science p. 740–755 (2014)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
-  Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. Lecture Notes in Computer Science p. 122–138 (2018)
-  Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 (2016)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 p. 234–241 (2015)
-  Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (Jun 2018)
-  Siam, M., Gamal, M., Abdel-Razek, M., Yogamani, S., Jagersand, M., Zhang, H.: A comparative study of real-time semantic segmentation for autonomous driving. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 700–70010. IEEE (2018)
-  Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (Jun 2018)
-  Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)
-  Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)