Log In Sign Up

Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations

by   Vladimir Nekrasov, et al.

Deployment of deep learning models in robotics as sensory information extractors can be a daunting task to handle, even using generic GPU cards. Here, we address three of its most prominent hurdles, namely, i) the adaptation of a single model to perform multiple tasks at once (in this work, we consider depth estimation and semantic segmentation crucial for acquiring geometric and semantic understanding of the scene), while ii) doing it in real-time, and iii) using asymmetric datasets with uneven numbers of annotations per each modality. To overcome the first two issues, we adapt a recently proposed real-time semantic segmentation network, making few changes to further reduce the number of floating point operations. To approach the third issue, we embrace a simple solution based on hard knowledge distillation under the assumption of having access to a powerful `teacher' network. Finally, we showcase how our system can be easily extended to handle more tasks, and more datasets, all at once. Quantitatively, we achieve 42 with a single model on NYUDv2-40, 87 (log) on KITTI-6 for segmentation and KITTI for depth estimation, with one forward pass costing just 17ms and 6.45 GFLOPs on 1200x350 inputs. All these results are either equivalent to (or better than) current state-of-the-art approaches, which were achieved with larger and slower models solving each task separately.


page 2

page 4

page 5

page 6


X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task Distillation

In this paper, we propose a novel method, X-Distill, to improve the self...

Light-Weight RefineNet for Real-Time Semantic Segmentation

We consider an important task of effective and efficient semantic image ...

Joint Semantic Segmentation and Depth Estimation with Deep Convolutional Networks

Multi-scale deep CNNs have been used successfully for problems mapping e...

FreDSNet: Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions

In this work we present FreDSNet, a deep learning solution which obtains...

Analyzing Modular CNN Architectures for Joint Depth Prediction and Semantic Segmentation

This paper addresses the task of designing a modular neural network arch...

StandardSim: A Synthetic Dataset For Retail Environments

Autonomous checkout systems rely on visual and sensory inputs to carry o...

LaTeS: Latent Space Distillation for Teacher-Student Driving Policy Learning

We describe a policy learning approach to map visual inputs to driving c...

Code Repositories


Light-Weight RefineNet for Real-Time Semantic Segmentation

view repo

I Introduction

As the number of tasks on which deep learning shows impressive results continues to grow in range and diversity, the number of models that achieve such results keeps analogously increasing, making it harder for practitioners to deploy a complex system that needs to perform multiple tasks at once. For some closely related tasks, such a deployment does not present a significant obstacle, as besides structural similarity, those tasks tend to share the same datasets, as, for example, the case of image classification, object detection, and semantic segmentation. On the other hand, tasks like segmentation and depth estimation rarely (fully) share the same dataset; for example, the NYUD dataset [1, 2] comprises a large set of annotations for depth estimation, but only a small labelled set of segmentations. One can readily approach this problem by simply updating the parameters of each task only if there exist ground truth annotations for that task. Unfortunately, this often leads to suboptimal results due to imbalanced and biased gradient updates. We note that while it is not clear how to handle such a scenario in the most general case, in this paper we assume that we have access to a large and powerful model, that can make an informative prediction to acquire missing labels. For each single task considered separately, this assumption is often-times valid, and we make use of it to predict missing segmentation masks.

Another issue that arises is the imperative in the context of robotics and autonomous systems for extraction of sensory information in real time. While there has been a multitude of successful approaches to speed up individual tasks [3, 4, 5], there is barely any prior work on performing multiple tasks concurrently in real-time. Here we show how to perform two tasks, depth estimation and semantic segmentation, in real-time with very few architectural changes and without any complicated pipelines.

Our choice of tasks is motivated by an observation that, for all sorts of robotic applications it is important for a robot (an agent) to know the semantics of its surroundings and to perceive the distances to the surfaces in the scene. The proposed methodology is simple and achieves competitive results in comparison to large models. Furthermore, we believe that there is nothing that prohibits practitioners and researchers to adapt our method for more tasks, which, in turn, would lead to better exploitation of deep learning models in real-world applications. To confirm this claim, we conduct additional experiments, predicting besides depth and segmentation, surface normals. Moreover, we successfully train a single model able to perform depth estimation and semantic segmentation, all at once, in both indoor and outdoor settings. In yet another case study, we demonstrate that raw outputs of our joint network (segmentation and depth) can be directly used inside the SemanticFusion framework [6] to estimate dense semantic D reconstruction of the scene.

To conclude our introduction, we re-emphasise that our results demonstrate that there is no need to uncritically deploy multiple expensive models, when the same performance can be achieved with one small network - a case of one being better than two!

Fig. 1: General network structure for joint semantic segmentation and depth estimation. Each task has only specific parametric layers, while everything else is shared

Ii Related Work

Our work is closely related to several topics. Among them are multi-task learning, semantic segmentation, depth estimation, and knowledge distillation.

According to the classical multi-task learning paradigm, forcing a single model to perform several related tasks simultaneously can improve generalisation via imposing an inductive bias on the learned representations [7, 8]

. Such an approach assumes that all the tasks use a shared representation before learning task-specific parameters. Multiple works in computer vision have been following this strategy; in particular, Eigen & Fergus 

[9] trained a single architecture (but with different copies) to predict depth, surface normals and semantic segmentation, Kokkinos [10] proposed a universal network to tackle different vision tasks, Dvornik  et al.[11] found it beneficial to do joint semantic segmentation and object detection, while Kendall et al.[12] learned optimal weights to perform instance segmentation, semantic segmentation and depth estimation all at once. To alleviate the problem of imbalanced annotations, Kokkinos [10] chose to accumulate the gradients for each task until a certain number of examples per task is seen, while Dvornik et al.[11] simply resorted to keeping the branch with no ground truth available intact until at least one example of that modality is seen.
We note that none of these methods make any use of already existing models for each separate task, and none of them, with the exception of BlitzNet [11], achieve real-time performance. In contrast, we show how to exploit large pre-trained models to acquire better results, and how to do inference in real-time.

Semantic segmentation is a task of per-pixel label classification, and most approaches in recent years have been centered around the idea of adapting image classification networks into fully convolutional ones able to operate on inputs of different sizes [13, 14, 15]. Real-time usage of such networks with decent performance is a non-trivial problem, and few approaches are currently available [5, 16, 17, 18]. We have chosen recently proposed Light-Weight RefineNet [18] on top of MobileNet-v2 [19] as our baseline architecture as it exhibits solid performance on the standard benchmark dataset, PASCAL VOC [20] in real-time, while having fewer than M parameters.

Depth estimation is another per-pixel task, the goal of which is to determine how far each pixel is from the observer. Traditionally, image based depth reconstruction was performed using SLAM based approaches [21, 22, 23]

. However, recent machine learning approaches have achieved impressive results, where a CNN has been successfully employed to predict a depth map from a single RGB image using supervised learning 

[9, 24, 25, 26]

, unsupervised learning

[27, 28]

and semi-supervised learning

[29]. Predicting multiple quantities including depths from a single image was first tackled by Eigen & Fergus [9]. Dharmasiri et al.[30] demonstrated that predicting related structural information in the form of depths, surface normals and surface curvature results in improved performances of all three tasks compared to utilising three separate networks. Most recently, Qi et al. [31] found it beneficial to directly encode a geometrical structure as part of the network architecture in order to perform depth estimation and surface normals estimation simultaneously. Our approach is fundamentally different to these previous works in two ways. Firstly, our network exhibits real-time performance on each individual task. Secondly, we demonstrate how to effectively incorporate asymmetric and uneven ground truth annotations into the training regime. Furthermore, it should be noted that despite using a smaller model running in real-time, we still quantitatively outperform these approaches.

Finally, we briefly touch upon the knowledge distillation approach [32, 33, 34, 35]

that is based on the idea of having a large pre-trained teacher (expert) network (or an ensemble of networks), and using its logits, or predictions directly, as a guiding signal for a small network along with original labels. Several previous works relied on knowledge distillation to either acquire missing data 

[36], or as a regulariser term [37, 38]. While those are relevant to our work, we differ along several axes: most notably, Zamir et al. [36] require separate network copies for different tasks, while Hoffman et al. [37] and Li & Hoiem [38] only consider a single task learning (object detection and image classification, respectively).

Iii Methodology

While we primarily discuss the case with only two tasks present, the same machinery applies for more tasks, as demonstrated in Sect. V-A.

Iii-a Backbone Network

As mentioned in the previous section, we employ the recently proposed Light-Weight RefineNet architecture [18] built on top of the MobileNet-v2 classification network [19]

. This architecture extends the classifier by appending several simple contextual blocks, called Chained Residual Pooling (CRP) 

[39], consisting of a series of max-pooling and convolutions (Fig. 1).

Even though the original structure already achieves real-time performance and has a small number of parameters, for the joint task of depth estimation and semantic segmentation (of classes) it requires more than  GFLOPs on inputs of size , which may hinder it from the direct deployment on mobile platforms with few resources available. We found that the last CRP block is responsible for more than the half of FLOPs as it deals with the high-resolution feature maps ( from the original resolution). Thus, to decrease its influence, we replace convolution in the last CRP block with its depthwise equivalent (i.e., into a grouped convolution with the number of groups being equal to the number of input channels). By doing that, we reduced the number of operations by more than half, down to just GFLOPs.

Iii-B Joint Semantic Segmentation and Depth Estimation

In the general case, it is non-trivial to decide where to branch out the backbone network into separate task-specific paths in order to achieve the optimal performance on all of them simultaneously. While placing no strong assumptions, instead of finding the optimal spot, we simply branch out right after the last CRP block, and append two additional convolutional layers (one depthwise and one plain ) for each task (Fig. 1).

If we denote the output of the network before the branching as , where is the backbone network with a set of parameters , and is the input RGB-image, then the depth and segmentation predictions can be denoted as and , where and are segmentation and depth estimation branches with the sets of parameters and , respectively. We use the standard softmax cross-entropy loss for segmentation and the inverse Huber loss for depth estimation [25]. Our total loss (Eqn. (1)) contains an additional scaling parameter, , which, for simplicity, we set to :


where and denote ground truth segmentation mask and depth map, correspondingly;

in the segmentation loss is the probability value of class

at pixel .

Sem. Segm. Depth Estimation General
Model Regime mIoU,% RMSE (lin),m RMSE (log) Parameters,M GFLOPs speed,ms (mean/std)
Ours Segm,Depth 3.07 6.49 12.80.1
RefineNet-101 [39] Segm 43.6
RefineNet-LW-50 [18] Segm
Context [40] Segm
Kendall and Gal [41] Segm,Depth 0.506
Fast Res.Forests [42] Segm
Eigen and Fergus [9] Segm,Depth
Laina et al. [25] Depth 0.195
Qi et al. [31] Depth,Normals - -
TABLE I: Results on the test set of NYUDv2. The speed of a single forward pass and the number of FLOPs are measured on inputs. For the reported mIoU the higher the better, whereas for the reported RMSE the lower the better. () means that both tasks are performed simultaneously using a single model, while () denotes that two tasks employ the same architecture but use different copies of weights per task

Iii-C Expert Labeling for Asymmetric Annotations

As one would expect, it is impossible to have all the ground truth sensory information available for each single image. Quite naturally, this poses a question of how to deal with a set of images among which some have an annotation of one modality, but not another. Assuming that one modality is always present for each image, this then divides the set into two disjoint sets and such that , where and denote two tasks, respectively, and the set consists of images for which there are no annotations of the second task available, while comprises images having both sets of annotations.

Plainly speaking, there is nothing that prohibits us from still exploiting equation (1), in which case only the weights of the branch with available labels will be updated. As we show in our experiments, this leads to biased gradients and, consequently, sub-optimal solutions. Instead, emphasising the need of updating both branches simultaneously, we rely on an expert model to provide us with noisy estimates in place of missing annotations.

More formally, if we denote the expert model on the second task as , then its predictions on the set can be used as synthetic ground truth data, which we will use to pre-train our joint model before the final fine-tuning on the original set with readily available ground truth data for both tasks. Here, we exploit the labels predicted by the expert network instead of logits, as storing a set of large

-D floating point tensors requires extensive resources.

Note also that our framework is directly transferable to cases when the set comprises several datasets. In Sect. V-B we showcase a way of exploiting all of them in the same time using a single copy of the model.

Iv Experimental Results

In our experiments, we consider two datasets, NYUDv2-40 [1, 2] and KITTI [43, 44], representing indoor and outdoor settings, respectively, and both being used extensively in the robotics community.

All the training experiments follow the same protocol. In particular, we initialise the classifier part using the weights pre-trained on ImageNet 

[45], and train using mini-batch SGD with momentum with the initial learning rate of - and the momentum value of . Following the setup of Light-Weight RefineNet [18], we keep batch norm statistics frozen. We divide the learning rate by after pre-training on a large set with synthetic annotations. We train with a random square crop of augmented with random mirroring.

All our networks are implemented in PyTorch 

[46]. To measure the speed performance, we compute

forward passes and report both the mean and standard deviation values, as done in 

[18]. Our workstation has GB RAM, Intel i5-7600 processor and a single GT1080Ti GPU card running CUDA9.0 and CuDNN7.0.

Iv-a NYUDv2

NYUDv2 is an indoor dataset with semantic labels. It contains RGB images with both segmentation and depth annotations, of which comprise the training set and - validation. The raw dataset contains more than training images with depth annotations. During training we use less than (K images) of this data. As discussed in Sect. III-C, we annotate these images for semantic segmentation using a teacher network (here, we take the pre-trained Light-Weight RefineNet-152 [18] that achieves mean iou on the validation set). After acquiring the synthetic annotations, we pre-train the network on the large set, and then fine-tune it on the original small set of images.

Quantitatively, we are able to achieve mean iou and m RMSE (lin) on the validation set (Table I), outperforming several large models, while performing both tasks in real-time simultaneously. More detailed results for depth estimation are given in Table II, and qualitative results are provided in Fig. 2.

Ours Laina et al. [25] Kendall and Gal [41] Qi et al. [31]

RMSE (lin)
0.565 0.573 0.506 0.569
RMSE (log) 0.205 0.195
abs rel 0.149 0.127 0.11
sqr rel 0.105 0.128
0.790 0.811 0.817 0.834
0.955 0.953 0.959 0.960
0.990 0.988 0.989 0.990
TABLE II: Detailed results on the test set of NYUDv2 for the depth estimation task. For the reported RMSE, abs rel and sqr rel the lower the better, whereas for accuracies () the higher the better
Fig. 2: Qualitative results on the test set of NYUD-v2. The black and dark-blue pixels in ‘GT-Segm’ and ‘GT-Depth’ respectively, indicate pixels without an annotation or label

Ablation Studies. To evaluate the importance of pre-training using the synthetic annotations and benefits of performing two tasks jointly, we conduct a series of ablation experiments. In particular, we compare three baseline models trained on the small set of images and three other approaches that make use of additional data - ours with noisy estimates from a larger model, and two methods, one by Kokkinos [10], where the gradients are being accumulated until a certain number of examples is seen, and one by Dvornik et al. [11], where the task branch is updated every time at least one example is seen.

The results of our experiments are given in Table III. The first observation that we make is that performing two tasks jointly on the small set does not provide any significant benefits for each separate task, and even substantially harms semantic segmentation. In contrast, having a large set of depth annotations results in valuable improvements in depth estimation and even semantic segmentation, when it is coupled with a clever strategy of accumulating gradients. Nevertheless, none of the methods can achieve competitive results on semantic segmentation, whereas our proposed approach reaches better performance without any changes to the underlying optimisation algorithm.

Annotations Update Frequency Sem. Segm. Depth
Method Pre-Training Fine-Tuning Task Base mIoU,% RMSE (lin),m
Baseline (SD) SD
Baseline (S) S
Baseline (D) D
BlitzNet [11] D + SD SD
UberNet [10] D + SD SD
Ours SD SD 42.02 0.5648
TABLE III: Results of ablation experiments on the test set of NYUDv2. SD means how many images have a joint pair of annotations - both segmentation (S) and depth (D); task update frequency denotes the number of examples of each task to be seen before performing a gradient step on task-specific parameters; base update frequency is the number of examples to be seen (regardless of the task) before performing a gradient step on shared parameters
Sem. Segm. Depth Estimation General
Model Regime mIoU,% RMSE (lin),m RMSE (log) Parameters,M Input Size GFLOPs speed,ms (mean/std)
Ours Segm,Depth 87.02 3.453 2.99 1200x350 6.45 16.90.1
Fast Res.Forests [42] Segm 1200x350
Wang et al. [47] Segm
Garg [27] Depth 5.104 0.273
Goddard [28] Depth 4.471 0.232 512x256
Kuznietsov [29] Depth 3.518 0.179 621x187
TABLE IV: Results on the test set of KITTI-6 for segmentation and KITTI for depth estimation

Iv-B Kitti

KITTI is an outdoor dataset that contains images semantically annotated for training (with semantic classes), and images for testing [44]. Following previous work by [47], we keep only well-represented classes.

Besides segmentation, we follow [24] and employ images with depth annotations available for training [43], and images for testing. Due to similarities with the CityScapes dataset [48], we consider ResNet- [14] trained on CityScapes as our teacher network to annotate the training images with given depth but not semantic segmentation. In turn, to annotate missing depth annotations on images with semantic labels, we first trained a separate copy of our network on the depth task only, and then used it as a teacher. Note that we abandoned this copy of the network and did not make any further use of it.

After pre-training on the large set, we fine-tune the model on the small set of examples. Our quantitative results are provided in Table IV, while visual results can be seen on Fig. 3. Per-class segmentation results are given in Table V.

Model sky building road sidewalk vegetation car Total
Ours 87.7 92.8 82.7 87.6 87.0
Fast Res.Forests [42] 87.8
Wang et al. [47] 88.6
TABLE V: Detailed segmentation results on the test set of KITTI-6
Fig. 3: Qualitative results on the test set of KITTI (for which only GT depth maps are available). We do not visualise GT depth maps due to their sparsity

V Extensions

The goal of this section is to demonstrate the ease with which our approach can be directly applied in other practical scenarios, such as, for example, the deployment of a single model performing three tasks at once, and the deployment of a single model performing two tasks at once under two different scenarios - indoor and outdoor. As the third task, here we consider surface normals estimation, and as two scenarios, we consider training a single model on both NYUD and KITTI simultaneously without the necessity of having a separate copy of the same architecture for each dataset.

In this section, we strive for simplicity and do not aim to achieve high performance numbers, thus we directly apply the same training scheme as outlined in the previous section.

V-a Single Model - Three Tasks

Fig. 4: Qualitative results on the test set of NYUD-v2 for three tasks. The black pixels in the ‘GT-Segm’ images indicate those without a semantic label, whereas the dark blue pixels in the ‘GT-Depth’ images indicate missing depth values

Analogously to the depth and segmentation branches, we append the same structure with two convolutional layers for surface normals. We employ the negative dot product (after normalisation) as the training loss for surface normals, and we multiply the learning rate for the normals parameters by , as done in [9].

We exploit the raw training set of NYUDv2 [1] with more than images, having (noisy) depth maps from the Kinect sensor and with surface normals computed using the toolbox provided by the authors. To acquire missing segmentation labels, we repeat the same procedure outlined in the main experiments - in particular, we use the Light-Weight RefineNet-152 network [18] to get noisy labels. After pre-training on this large dataset, we divide the learning rate by and fine-tune the model on the small dataset of images having annotations for each modality. For surface normals, we employ the annotations provided by Silberman et al. [1].

Our straightforward approach achieves practically the same numbers on depth estimation, but suffers a significant performance drop on semantic segmentation (Table VI). This might be directly caused by the excessive number of imperfect labels, on which the semantic segmentation part is being pre-trained. Nevertheless, the results on all three tasks remain competitive, and we are able to perform all three of them in real-time simultaneously. We provide a few examples of our approach on Figure 4.

Sem. Segm. Depth Est. Surface Normals Est. General
mIoU,% RMSE (lin),m RMSE (log) Mean Angle Dist. Median Angle Dist. speed,ms (mean/std)
42.02 0.565 0.205 12.80.1
TABLE VI: Results on the test set of NYUDv2 of our single network predicting three modalities at once with surface normals annotations from [1]. The speed of a single forward pass is measured on inputs. Baseline results (with a single network performing only segmentation and depth) are in bold

V-B Single Model - Two Datasets, Two Tasks

Next, we consider the case when it is undesirable to have a separate copy of the same model architecture for each dataset. Concretely, our goal is to train a single model that is able to perform semantic segmentation and depth estimation on both NYUD and KITTI at once. To this end, we simply concatenate both datasets and amend the segmentation branch to predict labels ( from NYUD and from KITTI-6).

We follow the exact same training strategy, and after pre-training on the union of large sets, we fine-tune the model on the union of small training sets. Our network exhibits no difficulties in differentiating between two regimes (Table VII), and achieves results at the same level with the separate approach on each of the datasets without a substantial increase in model capacity.

Sem. Segm. Depth Estimation Sem. Segm. Depth Estimation
mIoU,% RMSE (lin),m RMSE (log) mIoU,% RMSE (lin),m RMSE (log)
42.02 0.565 0.205 87.0 3.453 0.182
TABLE VII: Results on the test set of NYUDv2, KITTI (for depth) and KITTI-6 (for segmentation) of our single network predicting two modalities at once. Baseline results (with separate networks per dataset) are in bold

V-C Dense Semantic SLAM

Finally, we demonstrate that quantities predicted by our joint network performing depth estimation and semantic segmentation indoors can be directly incorporated into existing SLAM frameworks.

In particular, we consider SemanticFusion [6], where the SLAM reconstruction is carried out by ElasticFusion [49], which relies on RGB-D inputs in order to find dense correspondences between frames. A separate CNN, also operating on RGB-D inputs, was used by McCormac et al. [6] to acquire D semantic segmentation map of the current frame. A dense D semantic map of the scene is obtained with the help of tracked poses predicted by the SLAM system.

We consider one sequence of the NYUD validation set provided by the authors111, and directly replace ground truth depth measurements with the outputs of our network performing depth and segmentation jointly (Sect. IV-A). Likewise, we do not make use of the authors’ segmentation CNN and instead exploit segmentation predictions from our network. Note also that our segmentation network was trained on semantic classes, whereas here we directly re-map the results into the -classes domain [50]. We visualise dense surfel based reconstruction along with dense segmentation and current frame on Fig. 5. Please refer to supplementary video material for the full sequence results.

Point Cloud (ours) RGB Frame
Segm. Map (ours) Segm. Map [6]
Fig. 5: 3D reconstruction output using our per-frame depths and segmentation inside SemanticFusion [6]

Vi Conclusion

We believe that the ease of extraction of visual information in robotic applications using deep learning models is crucial for further development and deployment of robots and autonomous vehicles. To this end, we presented a simple way of achieving real-time performance for the joint task of depth estimation and semantic segmentation. We showcased that it is possible (and indeed beneficial) to re-use large existing models in order to generate synthetic labels important for the pre-training stage of a compact model. Moreover, our method can be easily extended to handle more tasks and more datasets simultaneously, while raw depth and segmentation predictions of our networks can be seamlessly used within available dense SLAM systems. As our future work, we will consider whether it would be possible to directly incorporate expert’s uncertainty during the pre-training stage to acquire better results, as well as the case when there is no reliable expert available. Another interesting direction lies in incorporating findings of Zamir et al. [36] in order to reduce the total number of training annotations without sacrificing performance.