Log In Sign Up

Fast Scene Understanding for Autonomous Driving

by   Davy Neven, et al.

Most approaches for instance-aware semantic labeling traditionally focus on accuracy. Other aspects like runtime and memory footprint are arguably as important for real-time applications such as autonomous driving. Motivated by this observation and inspired by recent works that tackle multiple tasks with a single integrated architecture, in this paper we present a real-time efficient implementation based on ENet that solves three autonomous driving related tasks at once: semantic scene segmentation, instance segmentation and monocular depth estimation. Our approach builds upon a branched ENet architecture with a shared encoder but different decoder branches for each of the three tasks. The presented method can run at 21 fps at a resolution of 1024x512 on the Cityscapes dataset without sacrificing accuracy compared to running each task separately.


page 1

page 3

page 4


MGNet: Monocular Geometric Scene Understanding for Autonomous Driving

We introduce MGNet, a multi-task framework for monocular geometric scene...

MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving

While most approaches to semantic reasoning have focused on improving pe...

PanoNet: Real-time Panoptic Segmentation through Position-Sensitive Feature Embedding

We propose a simple, fast, and flexible framework to generate simultaneo...

Lite-HDSeg: LiDAR Semantic Segmentation Using Lite Harmonic Dense Convolutions

Autonomous driving vehicles and robotic systems rely on accurate percept...

S3-Net: A Fast and Lightweight Video Scene Understanding Network by Single-shot Segmentation

Real-time understanding in video is crucial in various AI applications s...

Instance-Level Segmentation for Autonomous Driving with Deep Densely Connected MRFs

Our aim is to provide a pixel-wise instance-level labeling of a monocula...

I Introduction

The last years the re-appearance of Convolutional Neural Networks (CNNs), whose origin traces back to the 1970s and 1980s, has led to significant advances in many computer vision tasks, such as image classification 

[14], object detection [8], semantic scene segmentation [16], instance segmentation [9], and monocular depth estimation [6] to name a few. The majority of these works rely on fine-tuning or slightly altering a CNN architecture, typically the VGG network [19], resulting in task-specific CNNs with long inference times that each require a single GPU to run. Admittedly, this is not enough for autonomous driving applications where many of the aforementioned tasks should run in parallel, in real-time, and on a limited number of GPU devices. Furthermore, as shown in recent works [22, 20, 13] there is merit in combining multiple tasks in a single integrated architecture, as one task might benefit from another leaving smaller space for ’blindspots’, which is crucial for self-driving vehicles.

Motivated by these observations, in this paper we focus on street scene understanding and present an efficient implementation that combines the tasks of semantic scene segmentation, instance segmentation, and monocular depth estimation. Unlike state-of-the-art methods, that use networks with huge number of parameters and long inference times (e.g. VGG [19], SegNet [2], FCN [16]), we build upon a real-time architecture, in particular ENet [18] that has proven to offer image processing rates higher than 10 fps on a single GPU device. Specifically, we use a common ENet encoding step for all tasks, but introduce a branched ENet architecture for the decoding step (i.e. one branch for each of the three different tasks). Fig. 1 gives an overview of our approach.

Although we do not introduce a new architecture, in this paper we show how to efficiently combine existing components to build a solid architecture for real-time scene understanding. In Sec. II we describe related work on integrated architectures that tackle multiple tasks. Next, we present the implementation details of our method in Sec. III. Finally, in Sec. IV and V we report results for each of the tasks and provide some insights into the strengths and limitations of the presented approach.

Ii Related Work

Fig. 1: Overview of our method. From left to right: the input image is passed though the encoding step of an ENet-inspired architecture to create feature maps, which in turn are forwarded to different branches that perform decoding to arrive at the three outputs, i.e. semantic labels, instance labels and depth. A video is available at

The amount of research performed in literature on the three main tasks studied in this paper, i.e. semantic scene segmentation, instance segmentation, and monocular depth estimation, is vast. In what follows, we solely focus on related works that have combined one or more of these tasks in a single integrated architecture.

Eigen and Fergus [5] addressed the tasks of depth prediction, surface normal estimation, and semantic labeling using a multiscale convolutional network architecture that progressively refines predictions from a sequence of scales. Uhrig et al. [22] presented a method that leverages a FCN network to predict semantic labels, depth, and an instance-based encoding using each pixel’s direction towards its corresponding instance center and consequently applying low-level computer vision techniques. Kokkinos [13] went one step further from the previous approaches, and introduced a CNN, namely UberNet, that jointly handles low-, mid-, and high-level vision tasks in a unified architecture. His universal network tackles boundary detection, normal estimation, saliency estimation, semantic segmentation, human part segmentation, semantic boundary detection, region proposal generation, and object detection. Despite obtaining competitive performance while jointly addressing many different tasks, all these approaches suffer from poor inference times making them unsuitable for real-time autonomous driving applications with high frame-rate demands.

Recently, Teichmann et al. [20] argued that improving the computational times is more important than improving performance, especially for the case of self-driving vehicles. They presented an approach to joint classification, detection, and semantic segmentation via a unified architecture where the encoder is shared amongst the three tasks, marginally reaching a computational time of 10 fps on the KITTI dataset. Our approach also focuses on further improving the computational times but addresses different tasks, in particular semantic scene segmentation, instance segmentation, and monocular depth estimation, and achieves a computational time of 21 fps on the Cityscapes dataset. To our knowledge this is the first system to estimate depth, semantic and instance segmentation at these frame-rates.

Iii Method

In order to predict depth, semantic and instance segmentation in real-time, we modify the ENet architecture into a multi-branched network, having three output branches, one for each task (see Fig. 1). The original network, as described in [18], consists of an encoding step that has three stages (stage 1, 2, 3) and a decoding step that has two stages (stage 4, 5). Since the ENet decoding step is merely for upscaling and finetuning the output of the encoding step, sharing the full encoder (stages 1, 2, 3) between all branches would lead to poor results. Instead, our multi-branch network is constructed as follows: our shared ”encoder” consists of stages 1 and 2 of the original Enet network, before continuing to each branch that combines stage 3 of the original ENet encoder with stages 4 and 5 of the original ENet decoder. In what follows, we dive into the details of the individual branches that each performs one task.

Semantic segmentation The semantic segmentation branch is trained using the standard pixel-wise cross-entropy loss. The classes are weighted using the method described in [18]

and trained until convergence. The semantic segmentation is used for free space detection as well for classifying the objects found by the instance segmentation branch.

Instance segmentation

In order to perform instance segmentation using a typical feed-forward network without having to resort to slower detect-and-segment approaches, we use a recently introduced discriminative loss function 

[4] suited for real-time instance segmentation that can be plugged into an off-the-shelf network. The intuition behind the proposed loss function is that pixel embeddings (i.e. the network’s output for each pixel) with the same label (i.e. same instance) should end up close together, while embeddings with a different label (i.e. different instance) should end up far apart.

Inspired by Weinberger et al. [23] and other distance metric learning approaches, the authors propose a loss function with two competing terms to achieve this objective: a variance term pulling pixel embeddings towards the mean embedding of their cluster, and a distance term

pushing the clusters away from each other. To relax the constraints on the network, the variance and distance terms are hinged: embeddings within a distance of

from their cluster centers are no longer attracted to it and cluster centers further apart than are no longer repulsed. A small regularization pull-force that draws all clusters towards the origin keeps the activations bounded. These three terms can be written as follows, with the number of clusters in ground truth, the number of elements in cluster , an embedding, the mean embedding of cluster , the L2 distance, and denotes the hinge:


The final loss can then be written as the sum of the above terms: . When the loss has converged, all pixel embeddings are within a distance of from their cluster center and all cluster centers are at least apart. By setting , each embedding is closer to all embeddings of its own cluster than to any embedding of a different cluster. During inference we can then threshold with bandwidth around any embedding to select all embeddings belonging to the same cluster. Since the loss on the test set will not be zero, we apply a GPU accelerated variant of the mean-shift algorithm [7]

to shift to a center pixel around which we threshold, avoiding outliers.

Depth estimation from a single image The standard loss used in most regression problems, like monocular depth estimation, is the loss. It minimizes the difference between predicted and ground truth depth: , with . Recently, Eigen and Fergus [5] added two more terms to the typical loss for the depth estimation task; one for scale invariance (), and another for similarity in local structure (, with and denoting the horizontal and vertical image gradients). Instead, the depth estimation branch uses the reverse Huber loss (berHu) [17],


that shows a good balance between penalizing high residuals that usually account for the mean depth and low residuals that explain the smaller depth details. We have experimentally found that this choice yields a better final error than using the loss, even with the added terms. Notice that, the reverse Huber loss formulation above is continuous and first order differentiable at point , which is set to as in [15]. We use the SGM-calculated disparity depth maps of the Cityscapes dataset as ground truth for this task.

Training To train our multi-task network, the three losses described above are summed and equally weighted. Although different weights can also be used for each task we found that using equal weights already leads to good performance. We start from a pretrained encoder, trained for Cityscapes segmentation, and continue training the three tasks together. We train with a batch size of 10 at a resolution of 1024x512 and use Adam with a learning rate of 5e-4. Note that, we keep the parameters of the batch norm layers fixed.

Iv Results

Semantic and instance segmentation

IoU class IoU category
Segnet [2]
ENet [18]
SQ [21]
TABLE I: Semantic segmentation results on the Cityscapes benchmark.
AP AP0.5 AP100m AP50m
InstanceCut [12]
Pixelwise DIN [1]
DWT [3]
Shape-aware [10]
Mask R-CNN [11]
TABLE II: Instance segmentation results on the Cityscapes benchmark.

We report Cityscapes semantic segmentation results in Tab. I and instance segmentation results on the car class in Tab. II. We notice that by jointly training our network for 3 different tasks, we match and even slightly outperform standard ENet for semantic scene segmentation. This justifies our hypothesis that training with multiple tasks at once can increase the performance of each individual task.

As expected, our result for instance segmentation lacks behind the other methods on the Cityscapes benchmark, since they are all optimized for accuracy and are far from real-time. They either rely on a big network or use highly accurate pre-generated semantic segmentation labels, which explains their significantly higher performance, compared to our result. Nevertheless, this work can serve as a baseline for methods that also focus on speed.

Depth In Fig. 2 we plot for each car in the dataset its ground truth depth versus its predicted depth, which is calculated as an average over the predicted depth map masked out with the ground truth instance mask. The expected trend of nearby cars being predicted more accurately than far-away cars is clearly visible. Some of the extreme outliers are caused by cars that are mostly occluded and thus only consist of a few pixels. These extreme cases can in principle be detected and filtered out using the instance mask. We encourage others to include similar plots in their work on car depth estimation, as it is more informative than a single summary number.

Nevertheless, we follow [22] and report three metrics in Tab. III: mean absolute error (MAE), root mean squared error (RMSE) and absolute relative error (ARD). Note that we calculate the depth of each car by average pooling the predicted depth map with the ground truth instance masks. This is unlike [22], who calculate the depth with the predicted instance masks, and report the metrics only over predicted cars that match with ground truth cars. This means that the metric they report does not take the depth estimation of undetected smaller or badly visible cars into account, leading to a number that is dependent on the instance segmentation performance. By reporting the numbers over the ground truth car masks we avoid this entanglement, but some caution is necessary when comparing the numbers. We provide the numbers at different maximum depths of 100m, 50m and 25m.

Fig. 2: Left: ground truth and predicted depth for each car in the dataset. The expected trend of nearby cars being predicted more accurately than far-away cars is clearly visible. Right: a qualitative result (top: prediction, bottom: ground truth).
Fig. 3: Another qualitative result on the validation set. Top: predicted and ground truth summary picture. Bottom: predicted semantic, instance and depth maps.
Uhrig et al. [22] (val) m m %
Ours (val, 100m) m m %
Ours (val, 50m) m m %
Ours (val, 25m) m m %
TABLE III: Single image depth results on the Cityscapes validation set. Some caution necessary when comparing with [22] (sect. IV).

Multi-task network and speed In Tab. IV we provide a comparison between training the tasks separately (each running on an ENet of their own), versus training them together with a shared encoder as explained in the previous section. The benefits of training the three tasks together in a single multi-task network are clear: the speed almost doubles and the memory usage decreases drastically. This makes our approach suitable for real-time autonomous driving applications that require a low memory footprint. Important to note is that the accuracy of the individual tasks does not decrease when training together: in fact we even notice a slight performance increase. This suggests that the shared encoder can effectively learn to exploit the common structure of the three related semantic tasks.

semantic instance depth mem speed
Trained separately % % m GB fps
Trained together % % m GB fps
TABLE IV: Semantic segmentation (IoUc), instance segmentation (AP), depth (MAEm on the val set), memory, and speed (forward pass on a Pascal TitanX) at test time when trained separately versus together.

V Conclusion

Overall, our system is fast but lags behind the state-of-art in terms of segmentation accuracy. Nevertheless, we believe that it can serve as a low-complexity baseline for other multi-task approaches that focus on speed, and as a starting point for further exploration of the speed-accuracy trade-off in scene understanding. Furthermore, we observe that jointly training tasks can potentially lead to increased performance.

Acknowledgement: The work was supported by Toyota, and was carried out at the TRACE Lab at KU Leuven (Toyota Research on Automated Cars in Europe - Leuven).