1 Background and Motivation
Semantic segmentation is the ability to understand an image at the pixel level and assigning a label from a group of classes to every pixel. Semantic segmentation comes in two flavors, one that does not differentiate between object instances of the same class, referred to as pixel-level semantic segmentation and one that does, instance-level semantic segmentation. An example of an image that has been semantically labeled with objects of different classes such as roads, people, trees, etc shown in different colors is shown below.
A popular application of semantic segmentation is in autonomous driving systems, where reliable and accurate scene understanding is a critical component. In addition, there is also a strong requirement to segment the image in real-time as the self-driving car needs to react instantly to new events to guarantee the safety of the personnel involved.
Hence, there is a need to quantify and understand the network’s end-to-end response time i.e inference time, the bottlenecks that dictate it as well as methods or techniques that can be employed to improve it.
In this work, a Fully Convolutional Network architecture for the task of pixel-wise semantic segmentation on Cityscapes dataset is implemented and performance metrics are obtained. Numerical investigations are carried out for several inference optimization techniques such as weight quantization with a goal towards improving inference time. Finally, the trained model is then ported to an embedded platform (Nvidia Jetson TX1) and inference times are quantified when built-in optimizations in Nvidia’s TensorRT inference engine are enabled.
2 Related Work
In this section, literature work related to semantic segmentation, its application in the field of self-driving and the enablers (datasets) and the corresponding challenges in this context are outlined into three subcategories.
2.1 Deep semantic segmentation
Semantic segmentation, which was viewed as a challenging problem in computer vision until a few years ago, has witnessed rapid progress recently with deep learning
. One of the seminal works in this area that brought focus on the end-to-end learning of pixel-wise classification is the Fully Convolutional Network (FCN) architecture, which does not have any fully-connected layers at the end, that is typically used for classification but instead employs convolutional layers to classify each pixel in the image. The key insight in this work is that, the network first learns feature maps, whose height and width dimensions, are reduced by striding and pooling operations; which are then upsampled within the network using transpose convolution (or deconvolution), so that dimensions of output match that of the original input image, to get dense predictions.
One of the principal limitations of this approach, however, is the impact of the loss of resolution on the final prediction as the architecture relies on first downsampling the image into feature maps. This is addressed in , where a deeper transpose convolution network, with stacked deconvolution layers and unpooling layers, was employed to achieve performance gain as the deconvolution network is overly simple and the input to it is too coarse in . In SegNet, a similar approach with an encoder-decoder architecture is used to address the loss of detailed structures of an object due to a coarse feature map; The decoder network, however, uses the maxpooling indices from the corresponding encode layer to perform upsampling.
The issue of multi-scale semantics is the focus in , . Networks that work with a fixed size receptive field, can only handle single scale semantics. i.e if the object is substantially larger or smaller than the receptive field, then it is either fragmented or mislabeled. Building upon the idea of skip-architecture as proposed in  to merge feature maps from different resolutions, U-Net, a U-shaped encoder-decoder architecture network is developed, where feature maps from different initial layers are upsampled and added for the next layers is developed. Another work by  introduced dilated convolutions to aggressively increase the receptive field of the kernel without introducing parameters or subsampling, which provided a better solution for handling multiple scales.
2.2 Semantic segmentation in self-driving
Shifting gears into the application of semantic segmentation in the context of scene understanding in self-driving systems that puts forth the need to reduce the inference latency and hence the computation required. One approach is to come up with computationally efficient architectures such as Squeezenet, which demonstrated that it is possible to reproduce the image classification accuracy of Alexnet using 50x fewer parameters by using a more efficient architecture. ENet also presented a more efficient architecture with convolutional layer factorization. This is achieved by decomposing each x convolution into two smaller ones following each other: one with a x filter and the other with a x filter, which allows for large speedups, and greatly reduces the number of parameters, thus, making them less redundant.
Another line of research focuses on increasing the efficiency of existing networks by deriving smaller networks from larger counterparts , or by pruning or quantizing weights . Another trend in the industry is to tweak the network for execution on specific hardware design or implement them using platform specific libraries such as TensorRT that optimizes deep learning models for inference and creates a runtime for deployment on specific hardware platforms.
2.3 Datasets for street scene understanding
A major contributing factor to the progress of deep learning, especially to the problem of image classification is the availability of large-scale, publicly-available datasets such as ImageNet. Similarly, research progress in the application of semantic segmentation in self-driving for street scene understanding can be related to the existence of datasets such as KITTI Vision benchmark suite  and Camdvid . However, these datasets are relatively smaller and do not fully capture the variability and complexity of real world scenarios. Cityscapes is a high quality dataset for semantic street scene understanding with labeled examples of actual road scene images from 50 German cities collected in different weather conditions and, therefore, is tailored for autonomous driving in an urban environment . A more recent effort to build a much larger dataset resulted in the Mapillary Vistas dataset , with images with classes.
Despite the significant progress, the best inference times when it comes to semantic segmentation task in the embedded system are still less than frames per second , , which is clearly not acceptable as a viable commercial solution. Hence, there is a need to build upon these ideas and explore methods to reduce interference time.
FCN architecture, which still serves as a blueprint for most segmentation architectures, is employed in this study
. The network is composed of two parts - encoder and decoder. The encoder network corresponds to the feature extractor that transforms the input image to a multidimensional feature representation, whereas the decoder derives the semantic segmentation map from the features extracted from the encoder network.
The encoder is a modified VGG16 architecture , which was initially designed for the task of image classification and has been shown to generalize well to other datasets and a popular encoder choice for segmentation task as well. The main contribution of this architecture is its use of very small (3x3) convolution filters. It has demonstrated that replacing large kernel-sized filters with multiple 3x3 kernel-sized filters one after another for a given receptive field (the effective area size of input image on which output depends), enables it to learn more complex features, and that too at a lower cost.
This network is modified by replacing the three fully connected layers at the end with three 1x1 convolution layers with , and (the number of classes in the dataset) number of filters respectively.
|Conv : 3x3-64|
|Conv : 3x3-64|
|Conv : 3x3-128|
|Conv : 3x3-128|
|Conv : 3x3-256|
|Conv : 3x3-256|
|Conv : 3x3-256|
|Conv : 3x3-512|
|Conv : 3x3-512|
|Conv : 3x3-512|
|Conv : 3x3-512|
|Conv : 3x3-512|
|Conv : 3x3-512|
|Conv : 1x1-4096|
|Conv : 1x1-4096|
|Conv : 1x1-35|
|Total Memory||154 MB||128 MB|
Encoder Memory Estimates: Activations and Parameters
|up4: conv-transpose||16 x 32 x 512||(4x4x35 + 1) x 512|
|skip4: add||(16 x 32 x 512) x 2||0|
|up3: conv-transpose||32 x 64 x 256||(4 x 4 x 512 + 1) x 256|
|skip3 : add||(32 x 64 x 256) x 2||0|
|output:conv-transpose||256 x 512 x 35||(16 x 16 x 256 + 1) x 35|
|Total Memory||17.5 MB||8.75 MB|
The task of semantic segmentation can be interpreted as understanding “what” class a pixel belongs to as well as “where” the pixel is in the original image. The challenge, however, is that the semantic information resides in the deeper layers, which are coarser in resolution, while the location information resides in the shallower layers, which are finer in resolution. Therefore, to improve dense prediction, coarse, semantic information from deeper layers is combined with finer, appearance information and the way in which they are fused together results in different decoder architectures. In particular, three different networks are used in this study - FCN-32s, FCN-16s, and FCN-8s, whose architectures are different in only the decoder portion of the network as shown in Figure 3.
FCN-32s: Decoder with just one upsampling step of stride 32 for the final layer to recover the predictions for every pixel in the original image.
FCN-16s: Decoder which combines predictions from both the final layer and the pool4 layer, at stride 16, that results in predictions with finer details, while retaining high-level semantic information.
FCN-8s: Decoder which further combines additional predictions from pool3, at stride 8. This provides further precision compared to both FCN-32 and FCN-16 decoder networks.
Memory estimates for the activations and parameters of all layers in encoder and decoder (FCN-8s) are shown in Table 1 and Table 2 respectively. It can be observed that convolutional layers i.e shallower layers take up a lot of memory for activation, while deeper layers take up a lot of memory for parameters. Also, it is evident that the resources consumed by the decoder network, relative to the encoder network are relatively meager. Thus employing a deeper or complex decoder network might be a good option to improve the accuracy of dense predictions, which was demonstrated in . These estimates are also used to identify the correct batch size so that the model fits on the computing system during training, which is discussed in the implementation section.
The dataset that is used in this study is the Cityscapes dataset. The dataset has 5000 densely annotated images that are split into training, validation, and testing sets as , , and images respectively. The annotations have a total of classes belonging to groups as shown in Table 3 on pixel-level segmentation and only fine annotated images without any additional training data or augmentations are used in this evaluation.
|parking, rail track|
|bus, on rails,|
|fence, guard rail, bridge, tunnel|
|object||pole, pole group,|
|traffic sign, traffic light|
|void||ground, dynamic, static|
Will be labeled as group if the boundary between such instances cannot be clearly seen.
This label is not included in any evaluation and treated as void.
The images in the original dataset have a resolution of 2048 x 1024. Using full size images is beyond the compute capability of the system that was used as demonstrated by the memory requirements in Table 1. However, the original images are scaled down to 256 x 512 for training/validation/testing and scaled back up using INTER NEARESTinterpolation in OpenCV. Code from  was used to pre-process the data to use all labeled classes and generate ground truth images for updated labels as well as validation and calculating the metrics.
4.2 Hardware and Software
|Device||Processor||RAM||NVIDIA GPU||VRAM||Compute Capability|
|Desktop||i7-3770K 8 cores||32 GB||2 x GTX1080s||16 GB||6.1|
|Laptop||i7-7700HQ 4 cores||32 GB||GTX 1050||4 GB||6.1|
|Jetson TX1||Quad ARM A57 4 cores||4 GB shared||
The hardware setup across the devices is as described in Table. 4. The desktop and laptop setups have x86_64 Intel architectures while the Jetson TX1 has an ARM64 processor. The software setup across all devices is attempted to be kept constant to achieve an accurate comparison in their inference time and power parameters. All devices run an Ubuntu 16.04 Linux distro with the exception that the OS for the Jetson TX1 is optimized and packaged within the Jetpack 3.2 package as L4T from NVIDIA. Tensorflow 1.8.0 is installed on all devices, available as prepackaged wheel packages for x86_64 architectures and had to be compiled from source to run on the ARM64 processor on the Jetson TX1. CUDA 9.0 libraries and cuDNN are installed across all devices. Desktop and laptop builds have complete installations of TensorRT 18.104.22.168, while Jetpack 3.2 currently supports TensorRT 3.0 RC without Python API support. All other dependencies are met using either prepackaged installers or compiled from source for the ARM architecture which proved to be a cumbersome task on the weaker ARM processor.
4.3.1 Learning Rate
A parameter search with different learning rates is performed for about 10 epochs on the FCN-8s network and determined that an initial learning rate of 0.0001 and reducing it to 0.00001, after about 25 epoch, works best. No such study was performed for FCN-16s and FCN-32s networks, however, the learning rate for FCN-16s and FCN-32s are increased by a factor of 10 and 100 respectively, which seemed to work well.
4.3.2 Batch Size
4.3.3 Cross Entropy Loss (CE Loss)
Cross-entropy loss is given by the following equation .
where the number of pixels in the image or batch considered, the ground truth class of pixel ,
the network probability estimate of the ground truth probability of pixel, and
a vector of all network outputs, which is obtained by mapping the unnormalized scores of the network through a softmax function.
The three FCN (FCN-8s, FCN-16s, and FCN-32s) networks are implemented using TensorFlow. The encoder network is initialized with pre-trained VGG16 weights provided by Udacity and trained end-to-end including the encoder for about epochs each using Adam optimization with minimizing cross-entropy loss as the goal.
As the performance measure, the commonly used intersection over union metric will be used, which is evaluated for individual classes and categories. It is the standard Jaccard Index, commonly known as the PASCAL VOC intersection-over-union metric, where , , and are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set.
|CE Loss||mean IOU|
5.2 Segmentation Performance
The segmentation performance is evaluated on the validation set using the official Cityscapes evaluation script. A per-class mean IOU of , and and per-category mean IOU of , and for FCN-8s, FCN-16s and FCN-32s respectively. Detailed per class classification results are presented in Table 6 and per category classification results in Table 7.
The qualitative results of the network are presented in Table 9. Also, it can be noticed that the network tends to fail in labeling too large or small objects, due to its fixed-size receptive field . This trend can also be seen in the poor per-class IOU performance on the pole, traffic light or motorcycle classes, for example.
The learning curves for all the networks as well the mean IOU for training set as a function of training epochs is shown in Table 5. Due to the time constraints of the project, the training had to be stopped after about epochs each for all the networks, before the networks have fully attained its capacity. This explains slightly better performance for FCN-16s compared to FCN-8s and the lack of finer object structures in images for FCN-32s network relative to the other two networks. For example, in Image4 in Table 9, traffic signs and traffic lights are completely missing in FCN-32s. Also, comparing this data with the benchmark data for Cityscapes test set shown in Table 8 for two networks, which are designed with the goal of reducing inference time, the segmentation results obtained in this study are about % lower, which can also be understood, given the network training was stopped prematurely.
|class IOU||Category IOU|
5.3 Effect of built-in TensorFlow optimizations on model size and inference time
|Optimization Parameter||Model Size in MB|
|frozen model no optimizations||153.715||154.497||140.013|
A promising approach to reducing inference time and DRAM footprint (power consumption) is model compression. A compressed model that can easily fit into on-chip SRAM cache rather than off-chip DRAM memory will facilitate the deployment of deep networks in self-driving cars where memory size, inference speed, and network bandwidth are all strictly constrained. These will enable a fully trained network to be loaded into SRAM of an embedded processor inside a driverless car, thus providing on-chip in-memory inference at low power . Therefore, the effect of the optimization techniques in the Graph transform tools in TensorFlow  on the model size as well as the inference time is quantified in this study.
The first step in deploying a trained network is to freeze the network i.e fuse the information stored in graph definition and checkpoint files by fixing the weights of the network and removing irrelevant training information such as the optimizer options, gradients, etc. During training, weights are not stored in graph definitions as they are constantly tuned and are hence stored in separate checkpoint files, freezing removes the overhead incurred in fetching the latest variable values from separate files.
Once the network is frozen, the TensorFlow provided Graph transform tool can be used to perform optimizations on the saved GraphDef files(.pb). The transform graph tool supports a variety of transforms that can be applied to a network to optimize its size. The optimizations suggested in TernsorFlow’s documentation for deployment include stripping unnecessary and unused nodes, folding constants and batch norms and quantizing weights. Table 10 shows the various optimizations performed on the graph model and their effect on the model size. It is evident that the model size essentially remains the same for the majority of optimizations within an order of a few bytes, except for weight quantization, that converts large floating constant ops into 8-bit equivalents, where the model size reduced to th of the original size.
Table 11 gives the impact of the optimizations on the inference time of the graph. As expected, the optimizations which did not impact the model size also did not cause any significant changes in the inference times, and hence are grouped together with baseline inference time. Weight quantization, on the other hand, has a drastic effect on the inference time. Interestingly, this did not result in the reduction of inference times but rather increased by a factor of . This might be either due to the additional operations that are needed to work with quantized weights or due to the lack of system level drivers that leverage the memory optimizations mentioned in .
The effects of the underlying hardware platforms on the inference times are also quantified in Table 11. As expected, the inference time varies inversely proportional to the compute capability of the hardware, which is estimated to be of the order of for desktop, laptop, and Jetson TX1 respectively based on pure compute capability. (The NVIDIA Jetson TX1 has 256 CUDA cores and shared RAM, while the laptop has a GTX1050M GPU with 4GB DDR5 VRAM with 768 CUDA cores, which is three times more than the TX1 and the desktop setup has two NVIDIA GTX1080s with a total of 16GB RAM and 5120 CUDA cores (2560 x 2) which are about 20 times more than TX1 and 7 times more than the laptop.) In reality, this came to around , which is reasonably close, at least in order, to the theoretical estimate. When the baseline model is run on the TX1, an OOM(out of memory) error occurs and the process is killed, hence baseline results are not available. Weight quantized model, however, runs on the TX1 with no problems, which accounts for a need of such optimizations for embedded platforms.
5.4 Effect of built-in TensorRT optimizations on inference time
employing optimizations and calibrations to neural networks to obtain optimal performance in GPUs designed by NVIDIA. The effect of TensorRT optimizations on the inference times of the network is quantified in this study as the testing platforms are based on NVIDIA GPUs.
TensorFlow graphs are exported for use with other backends using the Universal Framework Format(uff), when a uff graph is parsed into a TensorRT engine the following four automatic optimizations are performed. Layer and Tensor fusion reduce the number of layers by recognizing and fusing layers that have the same input data and filter size, and CUDA kernels are fused together to perform sequential operations to overcome latency introduced due to multiple kernel launches. Precision calibration allows choosing the inference precision between FP32, FP16 or INT8 without a need for retraining the network. Kernel auto-tuning chooses an optimized kernel from a wide range of options to best suit the target GPU, input data, batch size, tensor layout, and many other such parameters. Dynamic Tensor Memory ensures that memory is reused by designating memory for a tensor only while it is being used, which prevents memory allocation overhead. These optimizations should bring a significant reduction in the inference time of the networks.
The bonnet framework  provides a C++ API for TensorRT, which is used to run the TensorRT optimizations on the three platforms and their effect on the inference time is documented in Table 12. As expected TensorRT optimizations showed a significant reduction in inference times with a reduction of % on the desktop and 50% on the laptop.
|Device||Baseline (ms)||TensorRT optimized (ms)|
5.5 Comparison of performance and power metrics across hardware platforms
Given that three hardware platforms have different compute capability as well as power consumption, to quantify and compare the inference times across platforms, the inference times should be normalized by the power consumed.
The power consumption data for the desktop and laptop devices are measured using the NVIDIA System Management Interface available as a command line utility with relevant parameters as shown below.
nvidia-smi daemon -i 0 -s p -d 5 -p /data/logs
The power reading provided by the tool is measured in Watts(W) for each GPU and is accurate to +/- 5 watts.
Measuring power on the Jetson TX1 is not as straightforward as the custom graphic driver is not bundled with SMI. The TX1 has INA monitors to measure current and voltage being drawn and are available to the processor through an i2c interface. The TX1 has a three channel monitor which provides the input current(mA), voltage(mV) and power(mW) at i2c address 0x40. The commands required to obtain these readings are:
The outputs of the
cat command can be redirected to a file for processing. Table 13 shows the average power consumed by each of the hardware platforms. Energy consumption, E(W-Hr) is the energy consumed by the platform to run test images, which is given in terms of inference time (ms) and average power consumed (P).
The relative performance of the network on the three platforms when measured purely in regards to inference times, i.e how many images can be inferred per second is (inversely proportional to inference times). However, when the performance is compared in regards to energy consumed i.e how many images can be inferred per W-hr, it changes to , outlining the power efficiency of the embedded system.
6 Conclusion and Future Work
In this project, the task of pixel-wise semantic segmentation in the context of self-driving with a goal to reduce inference time is explored. FCN based networks with a VGG16 encoder architecture and skip connections are trained on the Cityscapes dataset. On the validation set, the trained networks scored a per-class mean IOU class and per-category mean IOU of , , for FCN-8s, FCN-16s and FCN-32s networks respectively. Several network optimizations built into TensorFlow and TensorRT and their impact on inference times as well as model size are quantified. Finally, the trained network is ported onto Jetson TX1 and inference times across the hardware platforms are compared and presented.
This work could be further extended in several ways. In this study, though the inference times across hardware platforms are compared, corresponding IOU scores for the validation/test sets are not obtained, which is needed to fully understand the accuracy and inference tradeoff. Networks based on more efficient architectures such as SqueezeNet  coupled with optimizations could also be looked into to quantify their performance metrics on embedded platforms. Also, optimization techniques that need retraining such as pruning are not considered in this experiment, which could be explored as well.
TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §5.3.
-  (2017-12) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. External Links: Cited by: §2.1.
The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks.
Conference on Computer Vision and Pattern Recognition, Cited by: §4.3.3.
-  (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §4.1.
-  (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §2.3.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3, §4.1, §4.1.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §2.3.
-  (2015) The pascal visual object classes challenge: a retrospective. IJCV, 111(1):98–136. Cited by: §5.1.
-  (2013-09) Vision meets robotics: the kitti dataset. Int. J. Rob. Res. 32 (11), pp. 1231–1237. External Links: Cited by: §2.3.
-  (2015) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149. External Links: Cited by: §2.2, §5.3, §5.3.
-  (2015-03) Distilling the Knowledge in a Neural Network. ArXiv e-prints. External Links: Cited by: §2.2.
-  (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Cited by: §2.2, §6.
ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: §2.2.
-  (2015) Deep learning. Nature. External Links: Cited by: §1.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.. External Links: Cited by: §2.1, §2.1, §2.1, §3.2, §3.
Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics using CNNs. In Proc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), Cited by: §5.4.
-  (2017) The mapillary vistas dataset for semantic understanding of street scenes. In International Conference on Computer Vision (ICCV), External Links: Cited by: §2.3.
-  (2015) Learning deconvolution network for semantic segmentation. CoRR abs/1505.04366. External Links: Cited by: §2.1, §3.2, §5.2.
-  (2016) ENet: A deep neural network architecture for real-time semantic segmentation. CoRR abs/1606.02147. External Links: Cited by: §2.2, §2.3, Table 8.
-  (2015) U-net: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597. External Links: Cited by: §2.1.
-  (2017) Deep semantic segmentation for automated driving: taxonomy roadmap and challenges. Intelligent Transportation Systems Conference. Cited by: §2.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Cited by: §3.1.
-  (2016) Speeding up semantic segmentation for autonomous driving. Cited by: §1, §2.3, Table 8.
-  (2015) Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122. External Links: Cited by: §2.1.