Investigations on the inference optimization techniques and their impact on multiple hardware platforms for Semantic Segmentation

11/29/2019 ∙ by Sethu Hareesh Kolluru, et al. ∙ Stanford University 0

In this work, the task of pixel-wise semantic segmentation in the context of self-driving with a goal to reduce the inference time is explored. Fully Convolutional Network (FCN-8s, FCN-16s, and FCN-32s) with a VGG16 encoder architecture and skip connections is trained and validated on the Cityscapes dataset. Numerical investigations are carried out for several inference optimization techniques built into TensorFlow and TensorRT to quantify their impact on the inference time and network size. Finally, the trained network is ported on to an embedded platform (Nvidia Jetson TX1) and the inference time, as well as the total energy consumed for inference across hardware platforms, are compared.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background and Motivation

Semantic segmentation is the ability to understand an image at the pixel level and assigning a label from a group of classes to every pixel. Semantic segmentation comes in two flavors, one that does not differentiate between object instances of the same class, referred to as pixel-level semantic segmentation and one that does, instance-level semantic segmentation. An example of an image that has been semantically labeled with objects of different classes such as roads, people, trees, etc shown in different colors is shown below.

Figure 1: An annotated example image from Cityscapes dataset

A popular application of semantic segmentation is in autonomous driving systems, where reliable and accurate scene understanding is a critical component. In addition, there is also a strong requirement to segment the image in real-time as the self-driving car needs to react instantly to new events to guarantee the safety of the personnel involved


The advent of deep learning has made great strides towards better visual understanding

[14] and in particular, semantic segmentation; However, this performance was accomplished by increasing the depth of network as well as the computational infrastructure required. This poses even a greater challenge in the context of self-driving, as deploying such deep networks that work as inference engines is not feasible or at least difficult in an embedded device with a limited compute capability in a self-driving car.

Hence, there is a need to quantify and understand the network’s end-to-end response time i.e inference time, the bottlenecks that dictate it as well as methods or techniques that can be employed to improve it.

In this work, a Fully Convolutional Network architecture for the task of pixel-wise semantic segmentation on Cityscapes dataset is implemented and performance metrics are obtained. Numerical investigations are carried out for several inference optimization techniques such as weight quantization with a goal towards improving inference time. Finally, the trained model is then ported to an embedded platform (Nvidia Jetson TX1) and inference times are quantified when built-in optimizations in Nvidia’s TensorRT inference engine are enabled.

2 Related Work

In this section, literature work related to semantic segmentation, its application in the field of self-driving and the enablers (datasets) and the corresponding challenges in this context are outlined into three subcategories.

2.1 Deep semantic segmentation

Semantic segmentation, which was viewed as a challenging problem in computer vision until a few years ago, has witnessed rapid progress recently with deep learning


. One of the seminal works in this area that brought focus on the end-to-end learning of pixel-wise classification is the Fully Convolutional Network (FCN) architecture, which does not have any fully-connected layers at the end, that is typically used for classification but instead employs convolutional layers to classify each pixel in the image

[15]. The key insight in this work is that, the network first learns feature maps, whose height and width dimensions, are reduced by striding and pooling operations; which are then upsampled within the network using transpose convolution (or deconvolution), so that dimensions of output match that of the original input image, to get dense predictions.

One of the principal limitations of this approach, however, is the impact of the loss of resolution on the final prediction as the architecture relies on first downsampling the image into feature maps. This is addressed in [18], where a deeper transpose convolution network, with stacked deconvolution layers and unpooling layers, was employed to achieve performance gain as the deconvolution network is overly simple and the input to it is too coarse in [15]. In SegNet[2], a similar approach with an encoder-decoder architecture is used to address the loss of detailed structures of an object due to a coarse feature map; The decoder network, however, uses the maxpooling indices from the corresponding encode layer to perform upsampling.

The issue of multi-scale semantics is the focus in [20], [24]. Networks that work with a fixed size receptive field, can only handle single scale semantics. i.e if the object is substantially larger or smaller than the receptive field, then it is either fragmented or mislabeled. Building upon the idea of skip-architecture as proposed in [15] to merge feature maps from different resolutions, U-Net[20], a U-shaped encoder-decoder architecture network is developed, where feature maps from different initial layers are upsampled and added for the next layers is developed. Another work by [24] introduced dilated convolutions to aggressively increase the receptive field of the kernel without introducing parameters or subsampling, which provided a better solution for handling multiple scales.

2.2 Semantic segmentation in self-driving

Shifting gears into the application of semantic segmentation in the context of scene understanding in self-driving systems that puts forth the need to reduce the inference latency and hence the computation required. One approach is to come up with computationally efficient architectures such as Squeezenet[12], which demonstrated that it is possible to reproduce the image classification accuracy of Alexnet[13] using 50x fewer parameters by using a more efficient architecture. ENet[19] also presented a more efficient architecture with convolutional layer factorization. This is achieved by decomposing each x convolution into two smaller ones following each other: one with a x filter and the other with a x filter, which allows for large speedups, and greatly reduces the number of parameters, thus, making them less redundant.

Another line of research focuses on increasing the efficiency of existing networks by deriving smaller networks from larger counterparts [11], or by pruning or quantizing weights [10]. Another trend in the industry is to tweak the network for execution on specific hardware design or implement them using platform specific libraries such as TensorRT that optimizes deep learning models for inference and creates a runtime for deployment on specific hardware platforms.

2.3 Datasets for street scene understanding

A major contributing factor to the progress of deep learning, especially to the problem of image classification is the availability of large-scale, publicly-available datasets such as ImageNet

[7]. Similarly, research progress in the application of semantic segmentation in self-driving for street scene understanding can be related to the existence of datasets such as KITTI Vision benchmark suite [9] and Camdvid [5]. However, these datasets are relatively smaller and do not fully capture the variability and complexity of real world scenarios. Cityscapes is a high quality dataset for semantic street scene understanding with labeled examples of actual road scene images from 50 German cities collected in different weather conditions and, therefore, is tailored for autonomous driving in an urban environment [6]. A more recent effort to build a much larger dataset resulted in the Mapillary Vistas dataset [17], with images with classes.

Despite the significant progress, the best inference times when it comes to semantic segmentation task in the embedded system are still less than frames per second [19], [23], which is clearly not acceptable as a viable commercial solution. Hence, there is a need to build upon these ideas and explore methods to reduce interference time.

3 Architecture

FCN architecture, which still serves as a blueprint for most segmentation architectures, is employed in this study[15]

. The network is composed of two parts - encoder and decoder. The encoder network corresponds to the feature extractor that transforms the input image to a multidimensional feature representation, whereas the decoder derives the semantic segmentation map from the features extracted from the encoder network.

3.1 Encoder

Figure 2: Network Architecture: FCN-32, FCN-16, FCN-8

The encoder is a modified VGG16 architecture [22], which was initially designed for the task of image classification and has been shown to generalize well to other datasets and a popular encoder choice for segmentation task as well. The main contribution of this architecture is its use of very small (3x3) convolution filters. It has demonstrated that replacing large kernel-sized filters with multiple 3x3 kernel-sized filters one after another for a given receptive field (the effective area size of input image on which output depends), enables it to learn more complex features, and that too at a lower cost.

This network is modified by replacing the three fully connected layers at the end with three 1x1 convolution layers with , and (the number of classes in the dataset) number of filters respectively.

Activations Parameters
Conv : 3x3-64
Conv : 3x3-64
Pool 0
Conv : 3x3-128
Conv : 3x3-128
Pool 0
Conv : 3x3-256
Conv : 3x3-256
Conv : 3x3-256
Pool 0
Conv : 3x3-512
Conv : 3x3-512
Conv : 3x3-512
Pool 0
Conv : 3x3-512
Conv : 3x3-512
Conv : 3x3-512
Pool 0
Conv : 1x1-4096
Conv : 1x1-4096
Conv : 1x1-35
Total Memory  154 MB  128 MB
Table 1:

Encoder Memory Estimates: Activations and Parameters

Input from Encoder
(8 x 16 x 35)
Activations Parameters
up4: conv-transpose 16 x 32 x 512 (4x4x35 + 1) x 512
skip4: add (16 x 32 x 512) x 2 0
up3: conv-transpose 32 x 64 x 256 (4 x 4 x 512 + 1) x 256
skip3 : add (32 x 64 x 256) x 2 0
output:conv-transpose 256 x 512 x 35 (16 x 16 x 256 + 1) x 35
Total Memory  17.5 MB  8.75 MB
Table 2: Decoder(FCN-8s) Memory Estimates : Activations and Parameters

3.2 Decoder

The task of semantic segmentation can be interpreted as understanding “what” class a pixel belongs to as well as “where” the pixel is in the original image[15]. The challenge, however, is that the semantic information resides in the deeper layers, which are coarser in resolution, while the location information resides in the shallower layers, which are finer in resolution. Therefore, to improve dense prediction, coarse, semantic information from deeper layers is combined with finer, appearance information and the way in which they are fused together results in different decoder architectures. In particular, three different networks are used in this study - FCN-32s, FCN-16s, and FCN-8s, whose architectures are different in only the decoder portion of the network as shown in Figure 3.

FCN-32s: Decoder with just one upsampling step of stride 32 for the final layer to recover the predictions for every pixel in the original image.

FCN-16s: Decoder which combines predictions from both the final layer and the pool4 layer, at stride 16, that results in predictions with finer details, while retaining high-level semantic information.

FCN-8s: Decoder which further combines additional predictions from pool3, at stride 8. This provides further precision compared to both FCN-32 and FCN-16 decoder networks.

Memory estimates for the activations and parameters of all layers in encoder and decoder (FCN-8s) are shown in Table 1 and Table 2 respectively. It can be observed that convolutional layers i.e shallower layers take up a lot of memory for activation, while deeper layers take up a lot of memory for parameters. Also, it is evident that the resources consumed by the decoder network, relative to the encoder network are relatively meager. Thus employing a deeper or complex decoder network might be a good option to improve the accuracy of dense predictions, which was demonstrated in [18]. These estimates are also used to identify the correct batch size so that the model fits on the computing system during training, which is discussed in the implementation section.

Figure 3: Decoder Architecture: FCN-32s, FCN-16s, FCN-8s

4 Implementation

4.1 Dataset

The dataset that is used in this study is the Cityscapes dataset[6]. The dataset has 5000 densely annotated images that are split into training, validation, and testing sets as , , and images respectively. The annotations have a total of classes belonging to groups as shown in Table 3 on pixel-level segmentation and only fine annotated images without any additional training data or augmentations are used in this evaluation.

Group Classes
flat road, sidewalk,
parking, rail track
human person, rider
vehicle car, truck,
bus, on rails,
motorcycle, bicycle,
caravan, trailer
construction building, wall,
fence, guard rail, bridge, tunnel
object pole, pole group,
traffic sign, traffic light
nature vegetation terrain
sky sky
void ground, dynamic, static
  • Will be labeled as group if the boundary between such instances cannot be clearly seen.

  • This label is not included in any evaluation and treated as void.

Table 3: Cityscapes Dataset: Groups and Classes

The images in the original dataset have a resolution of 2048 x 1024. Using full size images is beyond the compute capability of the system that was used as demonstrated by the memory requirements in Table 1. However, the original images are scaled down to 256 x 512 for training/validation/testing and scaled back up using INTER NEARESTinterpolation in OpenCV[4]. Code from [6] was used to pre-process the data to use all labeled classes and generate ground truth images for updated labels as well as validation and calculating the metrics.

4.2 Hardware and Software

Device Processor RAM NVIDIA GPU VRAM Compute Capability
Desktop i7-3770K 8 cores 32 GB 2 x GTX1080s 16 GB 6.1
Laptop i7-7700HQ 4 cores 32 GB GTX 1050 4 GB 6.1
Jetson TX1 Quad ARM A57 4 cores 4 GB shared
NVIDIA Maxwell
256 CUDA cores
4 GB 5.3
Table 4: Hardware Setup

The hardware setup across the devices is as described in Table. 4. The desktop and laptop setups have x86_64 Intel architectures while the Jetson TX1 has an ARM64 processor. The software setup across all devices is attempted to be kept constant to achieve an accurate comparison in their inference time and power parameters. All devices run an Ubuntu 16.04 Linux distro with the exception that the OS for the Jetson TX1 is optimized and packaged within the Jetpack 3.2 package as L4T from NVIDIA. Tensorflow 1.8.0 is installed on all devices, available as prepackaged wheel packages for x86_64 architectures and had to be compiled from source to run on the ARM64 processor on the Jetson TX1. CUDA 9.0 libraries and cuDNN are installed across all devices. Desktop and laptop builds have complete installations of TensorRT, while Jetpack 3.2 currently supports TensorRT 3.0 RC without Python API support. All other dependencies are met using either prepackaged installers or compiled from source for the ARM architecture which proved to be a cumbersome task on the weaker ARM processor.

4.3 Training

4.3.1 Learning Rate

A parameter search with different learning rates is performed for about 10 epochs on the FCN-8s network and determined that an initial learning rate of 0.0001 and reducing it to 0.00001, after about 25 epoch, works best. No such study was performed for FCN-16s and FCN-32s networks, however, the learning rate for FCN-16s and FCN-32s are increased by a factor of 10 and 100 respectively, which seemed to work well.

4.3.2 Batch Size

Based on the memory calculation shown in Table 1 and Table 2, The memory consumption for each image is estimated to be about MB (forward and backward pass) + MB (to account for Adam optimization) = GB. Since the system used for training has a memory of GB, a batch size of is used.

4.3.3 Cross Entropy Loss (CE Loss)

Cross-entropy loss is given by the following equation [3].


where the number of pixels in the image or batch considered, the ground truth class of pixel ,

the network probability estimate of the ground truth probability of pixel

, and

a vector of all network outputs

, which is obtained by mapping the unnormalized scores of the network through a softmax function.


The three FCN (FCN-8s, FCN-16s, and FCN-32s) networks are implemented using TensorFlow. The encoder network is initialized with pre-trained VGG16 weights provided by Udacity and trained end-to-end including the encoder for about epochs each using Adam optimization with minimizing cross-entropy loss as the goal.

5 Experiments

5.1 Metric

As the performance measure, the commonly used intersection over union metric will be used, which is evaluated for individual classes and categories. It is the standard Jaccard Index, commonly known as the PASCAL VOC intersection-over-union metric

[8], where , , and are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set.

CE Loss mean IOU
Table 5: Training Characteristics

5.2 Segmentation Performance

The segmentation performance is evaluated on the validation set using the official Cityscapes evaluation script. A per-class mean IOU of , and and per-category mean IOU of , and for FCN-8s, FCN-16s and FCN-32s respectively. Detailed per class classification results are presented in Table 6 and per category classification results in Table 7.

The qualitative results of the network are presented in Table 9. Also, it can be noticed that the network tends to fail in labeling too large or small objects, due to its fixed-size receptive field [18]. This trend can also be seen in the poor per-class IOU performance on the pole, traffic light or motorcycle classes, for example.

The learning curves for all the networks as well the mean IOU for training set as a function of training epochs is shown in Table 5. Due to the time constraints of the project, the training had to be stopped after about epochs each for all the networks, before the networks have fully attained its capacity. This explains slightly better performance for FCN-16s compared to FCN-8s and the lack of finer object structures in images for FCN-32s network relative to the other two networks. For example, in Image4 in Table 9, traffic signs and traffic lights are completely missing in FCN-32s.  Also, comparing this data with the benchmark data for Cityscapes test set shown in Table 8 for two networks, which are designed with the goal of reducing inference time, the segmentation results obtained in this study are about % lower, which can also be understood, given the network training was stopped prematurely.

classes FCN-8s FCN-16s FCN-32s
road 0.94 0.942 0.918
sidewalk 0.635 0.657 0.505
building 0.809 0.811 0.726
wall 0.238 0.209 0.0
fence 0.195 0.201 0.0
pole 0.171 0.213 0.0
traffic light 0.111 0.159 0.0
traffic sign 0.295 0.356 0.0
vegetation 0.841 0.836 0.738
terrain 0.453 0.447 0.233
sky 0.870 0.865 0.770
person 0.464 0.486 0.016
rider 0.035 0.166 0.0
car 0.832 0.837 0.683
truck 0.203 0.330 0.0
bus 0.360 0.453 0.0
train 0.177 0.221 0.0
motorcycle 0.061 0.120 0.0
bicycle 0.452 0.443 0.0
Score Average 0.428 0.461 0.241
Table 6: Class IOU
Category FCN-8s FCN-16s FCN-32s
construction 0.792 0.797 0.702
flat 0.935 0.936 0.908
human 0.457 0.484 0.014
nature 0.835 0.833 0.726
object 0.206 0.258 0.0
vehicle 0.779 0.790 0.610
sky 0.870 0.865 0.770
Score Average 0.696 0.709 0.533
Table 7: Category IOU
class IOU Category IOU
based network[23] 0.598 0.843
ENet [19] 0.583 0.804
Table 8: Benchmarks - mean IOU for Cityscapes test set
Image1 Image2 Image3 Image4
  • Ground Truth

Table 9: Semantic Segmentation Inference Maps

5.3 Effect of built-in TensorFlow optimizations on model size and inference time

Optimization Parameter Model Size in MB
FCN8s FCN16s FCN32s
frozen model no optimizations 153.715 154.497 140.013
add_default_attributes 153.715 154.497 140.013
fold_constants(ignore_errors=true) 153.715 154.497 140.013
fold_batch_norms 153.715 154.497 140.013
fold_old_batch_norms 153.715 154.497 140.013
fuse_resize_and_conv 153.715 154.497 140.013
quantize_weights 38.481 38.673 35.048
strip_unused_nodes 153.715 154.497 140.013
sort_by_execution_order 153.715 154.497 140.013
remove_nodes(op=Identity, op=CheckNumerics) 153.709 154.493 140.008
merge_duplicate_nodes 153.710 154.494 140.010
All Optimizations 38.467 38.670 35.039
Table 10: Optimization Parameters and Network Size
Time (ms) =>
Desktop Laptop Jetson TX1
FCN8s FCN16s FCN32s FCN8s FCN16s FCN32s FCN8s FCN16s FCN32s
Baseline 26 24 23 75 69 64 - - -
Weight Quantized 60 60 56 95 90 82 760 772 714
Table 11: Effect of TensorFlow optimizations on inference time (ms)

A promising approach to reducing inference time and DRAM footprint (power consumption) is model compression. A compressed model that can easily fit into on-chip SRAM cache rather than off-chip DRAM memory will facilitate the deployment of deep networks in self-driving cars where memory size, inference speed, and network bandwidth are all strictly constrained. These will enable a fully trained network to be loaded into SRAM of an embedded processor inside a driverless car, thus providing on-chip in-memory inference at low power [10]. Therefore, the effect of the optimization techniques in the Graph transform tools in TensorFlow [1] on the model size as well as the inference time is quantified in this study.

The first step in deploying a trained network is to freeze the network i.e fuse the information stored in graph definition and checkpoint files by fixing the weights of the network and removing irrelevant training information such as the optimizer options, gradients, etc. During training, weights are not stored in graph definitions as they are constantly tuned and are hence stored in separate checkpoint files, freezing removes the overhead incurred in fetching the latest variable values from separate files.

Once the network is frozen, the TensorFlow provided Graph transform tool can be used to perform optimizations on the saved GraphDef files(.pb). The transform graph tool supports a variety of transforms that can be applied to a network to optimize its size. The optimizations suggested in TernsorFlow’s documentation for deployment include stripping unnecessary and unused nodes, folding constants and batch norms and quantizing weights. Table  10 shows the various optimizations performed on the graph model and their effect on the model size. It is evident that the model size essentially remains the same for the majority of optimizations within an order of a few bytes, except for weight quantization, that converts large floating constant ops into 8-bit equivalents, where the model size reduced to  th of the original size.

Table  11 gives the impact of the optimizations on the inference time of the graph. As expected, the optimizations which did not impact the model size also did not cause any significant changes in the inference times, and hence are grouped together with baseline inference time. Weight quantization, on the other hand, has a drastic effect on the inference time. Interestingly, this did not result in the reduction of inference times but rather increased by a factor of . This might be either due to the additional operations that are needed to work with quantized weights or due to the lack of system level drivers that leverage the memory optimizations mentioned in [10].

The effects of the underlying hardware platforms on the inference times are also quantified in Table  11. As expected, the inference time varies inversely proportional to the compute capability of the hardware, which is estimated to be of the order of for desktop, laptop, and Jetson TX1 respectively based on pure compute capability. (The NVIDIA Jetson TX1 has 256 CUDA cores and shared RAM, while the laptop has a GTX1050M GPU with 4GB DDR5 VRAM with 768 CUDA cores, which is three times more than the TX1 and the desktop setup has two NVIDIA GTX1080s with a total of 16GB RAM and 5120 CUDA cores (2560 x 2)  which are about  20 times more than TX1 and  7 times more than the laptop.) In reality, this came to around , which is reasonably close, at least in order, to the theoretical estimate. When the baseline model is run on the TX1, an OOM(out of memory) error occurs and the process is killed, hence baseline results are not available. Weight quantized model, however, runs on the TX1 with no problems, which accounts for a need of such optimizations for embedded platforms.

5.4 Effect of built-in TensorRT optimizations on inference time

employing optimizations and calibrations to neural networks to obtain optimal performance in GPUs designed by NVIDIA. The effect of TensorRT optimizations on the inference times of the network is quantified in this study as the testing platforms are based on NVIDIA GPUs.

TensorFlow graphs are exported for use with other backends using the Universal Framework Format(uff), when a uff graph is parsed into a TensorRT engine the following four automatic optimizations are performed. Layer and Tensor fusion reduce the number of layers by recognizing and fusing layers that have the same input data and filter size, and CUDA kernels are fused together to perform sequential operations to overcome latency introduced due to multiple kernel launches. Precision calibration allows choosing the inference precision between FP32, FP16 or INT8 without a need for retraining the network. Kernel auto-tuning chooses an optimized kernel from a wide range of options to best suit the target GPU, input data, batch size, tensor layout, and many other such parameters. Dynamic Tensor Memory ensures that memory is reused by designating memory for a tensor only while it is being used, which prevents memory allocation overhead. These optimizations should bring a significant reduction in the inference time of the networks.

The bonnet framework [16] provides a C++ API for TensorRT, which is used to run the TensorRT optimizations on the three platforms and their effect on the inference time is documented in Table  12. As expected TensorRT optimizations showed a significant reduction in inference times with a reduction of  % on the desktop and  50% on the laptop.

Device Baseline (ms) TensorRT optimized (ms)
Desktop 26 9
Laptop 75 34
Jetson TX1 - 460
Table 12: Effect of TensorRT optimizations on inference time (ms)

5.5 Comparison of performance and power metrics across hardware platforms

Given that three hardware platforms have different compute capability as well as power consumption, to quantify and compare the inference times across platforms, the inference times should be normalized by the power consumed.

The power consumption data for the desktop and laptop devices are measured using the NVIDIA System Management Interface available as a command line utility with relevant parameters as shown below.
nvidia-smi daemon -i 0 -s p -d 5 -p /data/logs
The power reading provided by the tool is measured in Watts(W) for each GPU and is accurate to +/- 5 watts.

Measuring power on the Jetson TX1 is not as straightforward as the custom graphic driver is not bundled with SMI. The TX1 has INA monitors to measure current and voltage being drawn and are available to the processor through an i2c interface. The TX1 has a three channel monitor which provides the input current(mA), voltage(mV) and  power(mW) at i2c address 0x40. The commands required to obtain these readings are:

cd /sys/devices/platform/7000c400.i2c/i2c-1/ cat 1-0040/iio_device/in_current0_input cat 1-0040/iio_device/in_voltage0_input cat 1-0040/iio_device/in_power0_input

The outputs of the cat command can be redirected to a file for processing. Table 13 shows the average power consumed by each of the hardware platforms. Energy consumption, E(W-Hr) is the energy consumed by the platform to run test images, which is given in terms of inference time (ms) and average power consumed (P).

The relative performance of the network on the three platforms when measured purely in regards to inference times, i.e how many images can be inferred per second is (inversely proportional to inference times). However, when the performance is compared in regards to energy consumed i.e how many images can be inferred per W-hr, it changes to , outlining the power efficiency of the embedded system.

Device Power Energy
Consumption(W) Consumption(W-hr)
Desktop 35.27 0.134
Laptop 22.65 0.326
Jetson TX1 4.16 0.810
Table 13: Average Power consumption for inference

6 Conclusion and Future Work

In this project, the task of pixel-wise semantic segmentation in the context of self-driving with a goal to reduce inference time is explored. FCN based networks with a VGG16 encoder architecture and skip connections are trained on the Cityscapes dataset. On the validation set, the trained networks scored a per-class mean IOU class and per-category mean IOU of , , for FCN-8s, FCN-16s and FCN-32s networks respectively. Several network optimizations built into TensorFlow and TensorRT and their impact on inference times as well as model size are quantified. Finally, the trained network is ported onto Jetson TX1 and inference times across the hardware platforms are compared and presented.

This work could be further extended in several ways. In this study, though the inference times across hardware platforms are compared, corresponding IOU scores for the validation/test sets are not obtained, which is needed to fully understand the accuracy and inference tradeoff. Networks based on more efficient architectures such as SqueezeNet [12] coupled with optimizations could also be looked into to quantify their performance metrics on embedded platforms. Also, optimization techniques that need retraining such as pruning are not considered in this experiment, which could be explored as well.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    Note: Software available from External Links: Link Cited by: §5.3.
  • [2] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017-12) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. External Links: Document, ISSN 0162-8828 Cited by: §2.1.
  • [3] M. Berman, A. R. Triki, and M. B. Blaschko (2018) The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In

    Conference on Computer Vision and Pattern Recognition

    Cited by: §4.3.3.
  • [4] G. Bradski (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §4.1.
  • [5] G. J. Brostow, J. Fauqueur, and R. Cipolla (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §2.3.
  • [6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3, §4.1, §4.1.
  • [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §2.3.
  • [8] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. (2015) The pascal visual object classes challenge: a retrospective. IJCV, 111(1):98–136. Cited by: §5.1.
  • [9] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013-09) Vision meets robotics: the kitti dataset. Int. J. Rob. Res. 32 (11), pp. 1231–1237. External Links: ISSN 0278-3649, Link, Document Cited by: §2.3.
  • [10] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149. External Links: Link, 1510.00149 Cited by: §2.2, §5.3, §5.3.
  • [11] G. Hinton, O. Vinyals, and J. Dean (2015-03) Distilling the Knowledge in a Neural Network. ArXiv e-prints. External Links: 1503.02531 Cited by: §2.2.
  • [12] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §2.2, §6.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: §2.2.
  • [14] H. G. LeCun Yann (2015) Deep learning. Nature. External Links: Link Cited by: §1.
  • [15] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.. External Links: Link Cited by: §2.1, §2.1, §2.1, §3.2, §3.
  • [16] A. Milioto and C. Stachniss (2019)

    Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics using CNNs

    In Proc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), Cited by: §5.4.
  • [17] G. Neuhold, T. Ollmann, S. Rota Bulò, and P. Kontschieder (2017) The mapillary vistas dataset for semantic understanding of street scenes. In International Conference on Computer Vision (ICCV), External Links: Link Cited by: §2.3.
  • [18] H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. CoRR abs/1505.04366. External Links: Link, 1505.04366 Cited by: §2.1, §3.2, §5.2.
  • [19] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) ENet: A deep neural network architecture for real-time semantic segmentation. CoRR abs/1606.02147. External Links: Link, 1606.02147 Cited by: §2.2, §2.3, Table 8.
  • [20] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597. External Links: Link, 1505.04597 Cited by: §2.1.
  • [21] Siam, Mennatullah, Elkerdawy, Sara, Jagersand, Martin, Yogamani, and Senthil (2017) Deep semantic segmentation for automated driving: taxonomy roadmap and challenges. Intelligent Transportation Systems Conference. Cited by: §2.1.
  • [22] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Link, 1409.1556 Cited by: §3.1.
  • [23] M. Treml, J. A. Arjona-Medina, T. Unterthiner, R. Durgesh, F. Friedmann, P. Schuberth, A. Mayr, M. Heusel, M. Hofmarcher, M. Widrich, B. Nessler, and S. Hochreiter (2016) Speeding up semantic segmentation for autonomous driving. Cited by: §1, §2.3, Table 8.
  • [24] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122. External Links: Link, 1511.07122 Cited by: §2.1.