Understanding the semantic content of images is enough to solve many vision-based applications. However, augmenting semantic information with depth is essential for many other applications too, as for on-board perception in autonomous driving and driver assistance. Due to cost and maintenance considerations, we wish to predict depth from the same single camera used to predict semantics, so having a direct pixelwise correspondence without further calibration. Therefore, in this paper, we focus on monocular depth estimation (MDE) on-board vehicles, thus, facing outdoor traffic environments. Recent advances on MDE rely on Convolutional Neural Networks (CNNs). Let be a CNN architecture for MDE with weights , which takes a single image as input, and estimates its pixelwise depth map as output, i.e., . The can be trained in a supervised manner, i.e., finding the values of by assuming access to a set of images with pixelwise depth ground truth (GT). Usually, such a GT is acquired at training time through a multi-modal suite of sensors, at least consisting of a camera calibrated with a LiDAR or some type of 3D laser scanner variant [Eigen:2014, Liu:2016, Roy:2016, Laina:2016, Cao:2017, Fu:2018DORN, Gurram:2018, He:2018, Xu:2018, Yin:2019]. Alternatively, we can use self-supervision based on either a calibrated stereo rig [Saxena:2007, Garg:2016, Godard:2017, Pillai:2019], or a monocular system and structure-from-motion (SfM) principles [Zhou:2017, Yin:2018GeoNet, Zhao:2020, Guizilini:20203D], or on a combination of both [Godard:2019MonoDepth2]. Combining stereo self-supervision and LiDAR supervision has been also analyzed [Kuznietsov:2017, He:2018wearable, Guizilini:2020]. In any case, the cheaper and simpler the suite of sensors used at training time, the better in terms of scalability and general access to the technology; however, the more challenging training a . Currently, supervised methods tend to outperform self-supervised ones [De:2021], thus, improving the latter is an open challenge worth to pursue.
This paper focuses on the most challenging setting, namely, when at training time we only have a single on-board camera allowing for SfM based self-supervision. Using only such a self-supervision may give rise to depth estimation inaccuracies due to camouflage (objects moving as the camera may not be distinguished from background), visibility changes (occlusion changes, non-Lambertian surfaces), static-camera cases (i.e., stopped ego-vehicle), and textureless areas, as well as to scale ambiguity (depth could only be estimated up to an unknown scale factor). In fact, an interesting approach to compensate for these problems could be leveraging virtual-world images (RGB) with accurate pixelwise depth (D) supervision. Using virtual worlds [Gaidon:2016, Cabon:2020, Ros:2016, Mayer:2016, Richter:2017, Shah:2017, Dosovitskiy:2017], we can acquire as many RGBD virtual-world samples as needed. However, these virtual-world samples can only be useful provided we address the virtual-to-real domain gap [Zheng:2018T2Net, Kundu:2018AdaDepth, Zhao:2019GASDA, Pnvr:2020SharinGAN, Cheng:2020S3Net], which links MDE with visual domain adaptation (DA), a realm of research in itself [Csurka:2017, Wang-Deng:2018, Wilson:2020].
We propose to perform monocular depth estimation by virtual-world supervision (MonoDEVS) and real-world SfM self-supervision, estimating depth in absolute scale. By relying on standard benchmarks, we show that our MonoDEVSNet outperforms previous ones trained on monocular and even stereo sequences. We think our released code and models111https://github.com/HMRC-AEL/MonoDEVSNet will help researchers and practitioners to address applications requiring on-board depth estimation, also establishing a strong baseline to be challenged in the future.
In the following, Section II summarizes previous works related to ours. Section III details our proposal. Section IV describes the experimental setting and discusses the obtained results. Finally, Section V summarizes the presented work and conclusions, and draws the work we target for the near future.
Ii Related Work
MDE was first addressed by combining hand-crafted features and shallow machine learning[Saxena:2007, Liu:2010, Ladicky:2014, Srikakulapu:2015]. However, nowadays, best performing models are based on CNNs [De:2021]. Therefore, we review the CNN-based approaches to MDE which are most related to ours.
Ii-a Supervised MDE
Relying on depth GT, Eigen et al. [Eigen:2014] developed a
architecture for coarse-to-fine depth estimation with a scale-invariant loss function. This pioneering work inspired new CNN-based architectures to MDE[Liu:2016, Laina:2016, Roy:2016, Cao:2017, He:2018, Xu:2018, Fu:2018DORN], which also assume depth GT supervision. MDE has been also tackle as a task on a multi-task learning framework, typically together with semantic segmentation as both tasks aim at producing pixelwise information and, eventually, may help each other to improve their predictions at object boundaries. For instance, this is the case of some ’s for indoor scenarios [Mousavian:2016, Jafari:2017, Jiao:2018]. These proposals assume that pixelwise depth and class GT are simultaneously available at training time. However, this is expensive, being scarcely available for outdoor scenarios. In order to address this problem, Gurram et al. [Gurram:2018] proposed a training framework which does not require depth and class GT to be available for the same images. Guizilini et al. [Guizilini:2020semantic] used an out-of-the-box CNN for semantic segmentation to train semantically-guided depth features while training .
The drawback of these supervised approaches is that the depth GT usually comes from expensive LiDARs, which must be calibrated and synchronized with the cameras; i.e., even if the objective is to use only cameras for the functionality under development. Moreover, LiDAR depth is sparse compared to available image resolutions. Besides, surfaces like vehicle glasses or dark vehicles may be problematic for LiDAR sensing. Consequently, depth self-supervision and alternative sources of supervision are receiving increasing interest.
Ii-B Self-supervised MDE
Using a calibrated stereo rig to provide self-supervision for MDE is a much cheaper alternative to camera-LiDAR suites. Garg et al. [Garg:2016] pioneered this approach by training with a warping loss involving pairs of stereo images. Godard et al. [Godard:2017] introduced epipolar geometry constraints with additional terms for smoothing and enforcing consistency between left-right image pairs. Chen et al. [Chen:2019] improved MDE results by enforcing semantic consistency between stereo pairs, via a joint training of and semantic segmentation. Pillai et al. [Pillai:2019]
implemented sub-pixel convolutional layers for depth super-resolution, as well as a novel differentiable layer to improve depth prediction on image boundaries, a known limitation of stereo self-supervision. Other authors[Kuznietsov:2017, He:2018wearable] still complement stereo self-supervision with sparse LiDAR supervision.
SfM principles [Ozyesil:2017] can be also followed to provide self-supervision for MDE. In fact, in this setting we can assume a monocular on-board system at training time. Briefly, the underlying idea is that obtaining a frame, , from consecutive ones, , can be decomposed into jointly estimating the scene depth for and the camera pose at time relative to its pose at time ; i.e., the camera ego-motion. Thus, we can train a CNN to estimate (synthesize) from , where, basically, the photo-metric error between and acts as training loss, being the output of this CNN (i.e., the synthesized view). After the training process, part of the CNN can perform MDE up to a scale factor (relative depth).
Zhou et al. [Zhou:2017] followed this idea, adding an explainability mask to compensate for violations of SfM assumptions (due to frame-to-frame changes on the visibility of frame’s content, textureless surfaces, etc.). This mask is estimated by a CNN jointly trained with to output a pixelwise belief on the synthesized views. Later, Yin et al. [Yin:2018GeoNet] proposed GeoNet, which aims at improving MDE by also predicting optical flow to explicitly consider the motion introduced by dynamic objects (e.g., vehicles, pedestrians), i.e. a motion that violates SfM assumptions. However, this was effective on predicting occlusions, but not in significantly improving MDE accuracy. Godard et al. [Godard:2019MonoDepth2] followed the idea of having a mask to indicate stationary pixels, which should not be taken into account by the loss driving the training. Such pixels typically appear on vehicles moving at the same speed as the camera, or can even correspond to full frames in case the ego-vehicle stops and, thus, the camera becomes stationary for a while. Pixels of similar appearance in consecutive frames are considered as stationary. A simple definition which can work because, instead of using a training loss based on absolute photo-metric errors (i.e. on minimizing pairwise pixel differences), it is used the structure similarity index measurement (SSIM) [Wang:2004]. Moreover, within the so-called MonoDepth2 framework, Godard et al. [Godard:2019MonoDepth2] combine SfM and stereo self-supervision to establish state-of-the-art results. Alternatively, Guizilini et al. [Guizilini:2020semantic] addressed the presence of dynamic objects by a two-stage MDE training process. The first stage ignores the presence of such objects, returning a trained with a loss based on SSIM. Then, before running the second stage, the training sequences are processed to filter out frames that may contain erroneous depth estimations due to moving objects. Such frames are identified by applying , a RANSAC algorithm to estimate the ground plane from their estimated depth, and determining if there is a significant number of pixels that would be projected far below the ground plane. Finally, in the second stage, is retrained form scratch without the filtered frames.
Zhao et al. [Zhao:2020]
focused on avoiding scale inconsistencies among different frames as produced under SfM self-supervision, specially when they are from sequences whose underlying depth range is too different. In this case, both depth and optical flow estimation CNNs are trained, but not a pose estimation one. Instead, the optical flow between two frames is used to find robust pixel correspondences between them, which are used to compute their relative camera pose, computing the fundamental matrix by the 8-point algorithm, and then performing triangulation between the corresponding pixels of these frames. Overall, a sparse depth pseudo-GT is estimated and used as supervision to train. However, even robustifying scale consistency among frames, this method still outputs relative depth rather than absolute one. To avoid this problem, Guizilini et al. [Guizilini:2020] used sparse LiDAR supervision with SfM self-supervision, relying on depth and pose estimation networks. More recently, Guizilini et al. [Guizilini:20203D] relied on camera velocity (i.e., the available vehicle velocity) to solve scale ambiguity in a pure SfM self-supervision setting. In particular, a velocity supervision loss trains the pose estimation CNN to learn scale-aware camera translation which, in turn, enables scale-aware depth estimation.
Overall, this literature shows the relevance of achieving MDE via SfM self-supervision and strategies to account for violation of SfM assumptions, as well as to obtain absolute depth values. Among these strategies, complementing SfM self-supervision with supervision (depth GT) coming from additional sensors such as a LiDAR and/or a stereo rig seems to be the most robust approach to address all the problems at once. However, then, a single camera would not be enough at training time. In this paper, we also complement SfM self-supervision with accurate depth supervision. However, instead of relying on additional sensors, we use virtual-world data.
Ii-C Virtual-world data for MDE
on virtual-world images to later perform on real-world ones, requires to address the virtual-to-real domain gap. Many approaches perform a virtual-to-real image-to-image translation coupled to the training of
. This translation usually relies on generative adversarial networks (GANs)[Goodfellow:2014, Choi:2020], since to train them only unpaired and unlabeled sets of real- and virtual-world images are required.
Zheng et al. [Zheng:2018T2Net] proposed Net. In this case, a GAN and are jointly trained, where the GAN aims at performing virtual-to-real translation while acting as an auto-encoder for real-world images. The translated images are the input for since they have depth supervision. Additionally, a GAN operating on the encoder weights (features) of was incorporated during training to force similar depth feature distributions between translated and real-world images. However, this feature-level GAN worsen MDE results in outdoor scenarios. Kundu et al. [Kundu:2018AdaDepth] proposed AdaDepth, which trains a common feature space for real- and virtual-world images, i.e., a space where it is not possible to distinguish the domain of the input images. Then, depth estimation is trained from this feature space. To achieve this, adversarial losses are used at the feature space level as well as at the estimated depth level.
Cheng et al. [Cheng:2020S3Net] proposed Net, which extends Net with SfM self-supervision. In this case, GAN training involves semantic and photo-metric consistency. Semantic consistency between the virtual-world images and their GAN-translated counterparts is required, which is measured via semantic segmentation (which involves also to jointly train a CNN for this task). Photo-metric consistency is required for consecutive GAN-translated images, which is measured via optical flow. Note that semantic segmentation and optical flow GT is available for virtual-world images. uses the GAN-translated images as input and is trained end-to-end with the GAN. Then, a further fine-tuning step of is performed using only the real-world sequence and SfM self-supervision, i.e., involving the training of a pose estimation CNN while fine-tuning. During this process, a masking mechanism inspired in [Godard:2019MonoDepth2] is also used to compensate for SfM-adverse scenarios. Contrary to AdaDepth and Net, Net just outputs relative depth.
Zhao et al. [Zhao:2019GASDA] proposed GASDA, which leverages real-world stereo and virtual-world data. In this case, the CycleGAN idea [Zhu:2017] is used to perform DA, which actually involves two GANs, one for virtual-to-real image translation and another for real-to-virtual. Two are trained coupled to CycleGAN, one intended to process images with real-world appearance (actual real-wold images or GAN-translated from the virtual domain), the other to process images with synthetic appearance (actual virtual-world images or GAN-translated from the real domain). In fact, at testing time, the most accurate depth results are obtained by averaging the output of these two , which also involves to translate the real-world images to the virtual domain by the corresponding GAN. Thanks to the stereo data, left-right depth and geometry consistency losses are also included during training aiming at obtaining a more accurate . PNVR et al. [Pnvr:2020SharinGAN] proposed SharinGAN for training a DA GAN coupled to a specific task. One of the selected tasks is MDE with stereo self-supervision, as in [Zhao:2019GASDA]. In this case, real- and virtual-world images are transformed to a new image domain where their appearance discrepancies are minimized to perform MDE from them, i.e. the GAN and the are jointly trained end-to-end. SharinGAN outperformed GASDA. However, at testing time, before performing the MDE, the real-world images must be translated by the GAN to the new image domain. Both GASDA and SharinGAN produce absolute scale depth.
Ii-D Relationship of MonoDEVSNet with previous literature
In term of operational training conditions, the most similar paper to ours is Net [Cheng:2020S3Net]. However, contrary to Net, our MonoDEVSNet can estimate depth in absolute scale. On the other hand, methods based on pure SfM self-supervision such as [Zhou:2017], [Yin:2018GeoNet], [Godard:2019MonoDepth2] (only SfM setting), and [Guizilini:2020semantic], just report relative depth. In order to compare MonoDEVSNet with them, we have estimated relative depth too. We will see how we outperform these methods, proving the usefulness of leveraging depth supervision from virtual worlds. In fact, regarding relative depth, we also outperform Net. Methods leveraging virtual-world data such as GASDA [Zhao:2019GASDA] and SharinGAN [Pnvr:2020SharinGAN], rely on real-world stereo data at training time, while we only require monocular sequences. On the other hand, our training framework can be extended to accommodate stereo data if available, although it is not our current focus. Net, GASDA, SharinGAN, Net [Zheng:2018T2Net], and AdaDepth [Kundu:2018AdaDepth], leverage ideas from GAN-based DA to reduce the virtual-to-real domain gap, either in image space (Net, GASDA, SharinGAN, Net) or in feature space (AdaDepth). We have analyzed both, image and feature based DA, finding that the later outperforms the former. In particular, by using the Gradient-Reversal-Layer (GRL) DA strategy [Ganin:2015, Ganin:2016], up to the best of our knowledge, not previously applied to MDE. Currently, we outperform the SfM self-supervision framework in [Guizilini:20203D] thanks to the virtual-world supervision and our GRL DA strategy. However, using vehicle velocity to obtain absolute depth as in [Guizilini:20203D], is a complementary strategy that could be also incorporated in our framework, although it is not the focus on this paper.
In this section, we introduce MonoDEVSNet, which aims at leveraging virtual-world supervision to improve real-world SfM self-supervision. Since we train from both real- and virtual-world data jointly, we describe our supervision and self-supervision losses, the loss for addressing the virtual-to-real domain gap, and the strategy to obtain depth in absolute scale. Our proposal is visually summarized in Fig. 1.
Iii-a Training data
For training MonoDEVSNet, we assume two sources of data. On the one hand, we have image sequences acquired by a monocular system on-board a vehicle while driving in real-world traffic. We denote as one of such frames acquired at time . We denote these data as , where is the number of frames from the real-world sequences. These frames do not have associated GT. On the other hand, we have analogous sequences but acquired on a virtual world, i.e., on-board a vehicle immersed in a traffic simulation. We denote as one of such virtual-world frames acquired at time . We refer to these data as , where is the number of frames from the virtual-world sequences. The images in do have associated GT, since it can be automatically generated. In particular, as it is commonly available in today’s simulators, we assume pixelwise depth and semantic class GT. We define to be this GT; i.e., given , is its depth GT, and its semantic class GT.
Iii-B MonoDEVSNet architecture:
MonoDEVSNet, i.e., our , is composed of three main blocks: a encoding block of weights , a multi-scale pyramidal block, , and a decoding block inspired in [Godard:2019MonoDepth2], . Therefore, the total set of weights is . Here, acts as a backbone of features. Moreover, since we aim at evaluating several encoders, the role of the multi-scale pyramid block is to adapt the bottleneck of the chosen encoder to the decoder. At testing time will process any real-world image acquired on-board the ego-vehicle, while at training time either or .
Iii-C Problem formulation
Training consists in finding the optimum weight values, , by solving the problem:
where is a loss function, and indicates the use of the virtual-world frames with their GT. As we are going to detail, relies on three different losses, namely, and . The loss focuses on training based on SfM self-supervision, thus, only relying on real-world data sequences. The SfM self-supervision is achieved with the support of a camera pose estimation task performed by a CNN, , of weights . Thus, we have . The loss focuses on training with virtual-world supervision, in particular, using depth and semantic GT from virtual-world sequences. Therefore, we have . Finally, focuses on creating domain-invariant features as part of
. In particular, we rely on a binary real/virtual domain-classifier CNN,, of weights . Thus, we have .
Iii-D SfM Self-supervised loss:
Since we focus on improving MDE by the additional use of virtual-world data, for the SfM self-supervision we leverage from the state-of-the-art proposal in [Godard:2019MonoDepth2], which we briefly summarize here for the sake of completeness as:
As previously introduced in [Godard:2017], the term is a constant weighted loss to force local smoothness on , taking into account the edges of . The term is the actual SfM-inspired loss. It involves the joint training of the depth estimation weights, , and the relative camera pose estimation weights, . Figure 1 illustrates the CNN, , associated to these weights, which takes as input two consecutive frames, e.g., , and outputs the pose transform (rotation and translation), , between them. Then, as can be seen in Fig. 1, a projection module takes and the depth estimation , to generate the synthesized frame which, ideally, should match . In fact, both frames adjacent to are considered for robustness. Accordingly, the SfM-inspired component of can be defined as:
where is a pixelwise conditioned photo-metric error and its average over the pixels. Obtaining starts by computing two pixelwise photo-metric error measurements, and , where , and is the pixelwise photo-metric error between and proposed in [Godard:2017], i.e., based on local structural similarity (SSIM) and pixelwise photo-metric absolute differences between and . Thus, applies pixelwise. Then, a pixelwise binary mask, called auto-mask in [Godard:2019MonoDepth2], is computed as:
where denotes the Iverson bracket applied pixelwise. Finally, is computed as:
where stands for pixelwise multiplication. The auto-mask conditions which pixels of are considered during the gradient computation of , i.e., is computed in the forward pass of CNN training, but it is considered as a constant during back-propagation. As explained in [Godard:2019MonoDepth2], the aim of is to remove, during training, the influence of pixels which remain the same between adjacent frames because they are assumed to often indicate SfM violations such as a static camera, objects moving as the camera, or low texture regions. Finally, we remark that the support of is needed at training time, but not at testing time.
Iii-E Supervised loss:
In this case, since we address an estimation problem and we have accurate GT, we base on the L1 metric. On the other hand, MDE is specially interesting to determine how far is the ego-vehicle from vehicles, pedestrians, etc. Accordingly, since includes semantic class GT, we use it to increase the relevance of accurately estimating the depth for such major traffic protagonists. Moreover, since virtual-world depth maps are based on the Z-buffer involved on image rendering, the range of depth values available as GT tend to be over-optimistic even for active sensors such as LiDAR. For instance, there can be depth values larger than in the Z-buffer. Since we do not aim at estimating depth beyond a reasonable threshold (in ), , to compute we will also discard pixels with . For each , both the semantic class relevance and the out-of-range depth values, can be codified as real-valued weights running on and arranged on a mask, . Thus, depends on and . However, contrarily to , we can compute offline, i.e., before starting the training process. Taking all these details into account, we define our supervised loss as:
Iii-F Domain adaptation loss:
As can be seen in Fig. 1, we aim at learning depth features, , so that it cannot be distinguished whether they were generated from a real-world input frame (target domain) or a virtual-world one (source domain); in other words, learning a domain invariant . Taking into account that we do not have accurate depth GT in the target domain, while we do have it for the source domain, we need to apply an unsupervised DA technique to train . In addition, as part of , the training of must result on an accurate . Achieving this accuracy and domain invariance are adversarial goals. Accordingly, we propose to use the Gradient-Reversal-Layer (GRL) idea introduced in [Ganin:2015], which, up to the best of our knowledge, has not been applied before for DA in the context of MDE. In this approach, the domain invariance of is measured by a binary target/source domain-classifier CNN, , of weights . In [Ganin:2015], a logistic loss is proposed to train the domain classifier. In our case, this is set as:
where we assume that outputs 1 if and 0 if . The GRL has no parameters and connects with (see Fig. 1). Its behavior is exactly as explained in [Ganin:2015]
. This means that during forward passes of training, it acts as an identity function, while, during back-propagation, it reverses the gradient vector passing through it. Both the GRL andare required at training time, but not at testing time.
Iii-G Overall training procedure
Algorithm 1 summarizes the steps to compute the needed gradient vectors for mini-batch optimization. In particular, we need the gradients related to MonoDEVSNet weights, , and the weights of the auxiliary tasks, i.e., for SfM self-supervision, and for DA. Regarding gradient computation, we do not need to distinguish from , so we define . In Alg. 1, we introduce an equalizing factor between supervised and self-supervised losses, , which aims at avoiding one loss dominating over the other along the training. A priori, we could set a constant factor. However, in practice, we have found that having an adaptive value is more useful. Therefore, inspired by the GradNorm idea [Chen:2018GradNorm], we use the ratio between the supervised and self-supervised losses. Algorithm 1 also introduces the scaling factor which, following [Ganin:2015], controls the trade-off between optimizing to obtain an accurate model versus being domain invariant. Finally, and indicate whether this loss must be computed only using virtual- or real-world data, respectively.
Iii-H Absolute depth computation
The virtual-world supervised data trains on absolute depth values, while the real-world SfM self-supervised data trains on relative depth values. Thanks to the unsupervised DA, the depth features are trained to be domain invariant. However, according to our experiments, this is not sufficient for producing accurate absolute depth values at testing time. Fortunately, thanks to the use of virtual-world data, we can still compute a global scaling factor, , so that is accurate in absolute depth terms. For that, we assume that the sequences in are acquired with a camera analogous to the one used to acquire the sequences in . Here analogous refers to using the same number of pixels, field of view, frame rate, and mounted on-board in similar heading directions. Note that simulators are flexible enough for setting these camera parameters as needed. Accordingly, we train a model using only data from and SfM self-supervision, i.e. as if we would not have supervision for . Then, we find the median depth value produced by this model on the virtual-world data, . Finally, we set , where is the median depth value of the GT. This pre-processing step is performed once and the model discarded afterwards. Other works follow a similar idea [Zhou:2017, Godard:2019MonoDepth2, Cheng:2020S3Net, Guizilini:2020semantic, Zhao:2020] to compute absolute depth but relying on LiDAR data as GT reference, while we only rely on virtual-world data.
Forward Passes with
Back-propagation for Supervision & DA
Forward Passes with
Back-propagation for Self-supervision & DA
Setting the final gradient vectors
Iv Experimental Results
In this section, we start by defining the datasets and evaluation metrics used in our experiments. After, we provide relevant implementation and training details of MonoDEVSNet. Finally, we present and discuss our quantitative and qualitative results, comparing them with those from previous literature as well as performing an ablative analysis focused on the main components of MonoDEVSNet.
Iv-a Datasets and evaluation metrics
We use publicly available datasets and metrics which are de facto standards in MDE research. In particular, we use KITTI Raw (KR) [Geiger:2013] and Virtual KITTI (VK) [Cabon:2020] as real- and virtual-world sequences, respectively. We follow Zhou et al. [Zhou:2017] training-testing split. From the training split we select 12K monocular triplets, i.e., samples of the form . The testing split consists of 697 isolated images with LiDAR-based GT, actually introduced by Eigen et al. [Eigen:2014]. In addition, for considering the semantic content of the images in the analysis of results, we also use KITTI Stereo 2015 (KS) [Menze:2015] for testing. This dataset consists of 200 isolated images with enhanced depth maps and semantic labels. VK is used only for training, we also use 12K monocular triplets (non-rotated camera subset) with associated depth GT. In this case, the triplets are used to calibrate the global scaling factor (see Sect. III-H), while for actual training supervision only 12K isolated frames are used. As the depth GT of VK ranges up to m, to match the range of KR’s LiDAR-based GT, we clip it to m (). VK includes similar weather conditions as KR/KS, and adds situations with fog, overcast, and rain, as well as sunrise and sunset illumination.
Finally, as is common since [Godard:2017], we use Make3D dataset [Saxena:2009] for assessing generalization since it is based on photographs at urban and natural areas. Therefore, Make3D shows views and content pretty much different from those on-board a vehicle as KR, KS, and VK. The images come with depth GT acquired by a 3D scanner. There are 534 images with depth GT, organized in a standard split of 400 for training and 134 for testing. We use the latter, since we rely on Make3D only for testing our proposal.
In order to assess quantitative MDE results, we use the standard metrics introduced by Eigen et al. [Eigen:2014], i.e., the average absolute relative error (abs-rel), the average squared relative error (sq-rel), the root mean square error (rms), and the rms log error (rms-log). For these metrics, the lower the better. In addition, the accuracy (termed as ) under a threshold is also used as metric. In this case, the higher the better. The abs-rel error and the are percentage measurements, sq-rel and rms are reported in meters, and rms-log is similar (reported in meters) to rms but applied to logarithm depth values.
These metrics are applied to absolute depth values for MDE models trained with depth supervision coming from either LiDAR [Eigen:2014, Liu:2016, Roy:2016, Laina:2016, Cao:2017, Fu:2018DORN, Gurram:2018, He:2018, Xu:2018, Yin:2019, Guizilini:2020], stereo [Saxena:2007, Garg:2016, Godard:2017, Godard:2019MonoDepth2, Pillai:2019], real-world stereo and virtual-world depth [Zhao:2019GASDA, Pnvr:2020SharinGAN], or stereo and LiDAR [Kuznietsov:2017, He:2018wearable]. However, MDE models trained on pure SfM self-supervision can only estimate depth in relative terms, i.e., up to scale. Moreover, the scale factor varies from image to image, a problem known as scale inconsistency. In this case, before computing the above metrics, it is applied a per-image correction factor computed at testing time [Zhou:2017, Yin:2018GeoNet, Zhao:2020, Godard:2019MonoDepth2, Guizilini:2020semantic, Cheng:2020S3Net]. In particular, given a test image with GT and estimated depth and , respectively, the common practice consists of computing a scale as the ratio , and then compare with . On the other hand, SfM self-supervision with the help of additional information can train models able to produce absolute scale in testing time. For instance, [Guizilini:20203D] uses the ego-vehicle speed and, in fact, virtual-world supervision can help too [Zheng:2018T2Net, Kundu:2018AdaDepth]. The latter approach is the one followed in this paper, especially thanks to the procedure presented in Sect. III-H. Therefore, will be evaluated in relative scale terms, and in absolute terms. Please, note that our scaling factor is constant for all the evaluated images and computed at training time. In the following, when presenting quantitative results, we will make clear if they are in relative or absolute terms.
Iv-B Implementation details
We start by selecting the actual CNN layers to implement . Since we leverage the SfM self-supervision idea from [Godard:2019MonoDepth2], a straightforward implementation would be to use its ResNet-based architecture as it is. However, the High-Resolution Network (HRNet) architecture [Wang:2020HrNet], exhibits better accuracy in visual tasks such as semantic segmentation and object detection, suggesting that it can be a better backbone than ResNet. Thus, we decided to start our experiments by comparing ResNet and HRNet backbones using the SfM self-supervision framework provided in [Godard:2019MonoDepth2]. In particular, we assess different ResNet/HRNet architectures for , while using the proposal in [Godard:2019MonoDepth2] for . Then, when using ResNet we have , while for HRNet consists of pyramidal layers adapting the and CNN architectures under test. For these experiments, we rely on KR. TableI shows the accuracy (in relative scale terms) of the tested variants and their number of weights. We see how HRNet outperforms ResNet, being HRNet-W48 the best. Thus, for our following experiments, we will rely on HRNet-W48 although being the heaviest. We show the corresponding pyramidal architecture of in Fig. 2. It is composed of five blocks (Paszke:2019pytorch].
In order to train the camera pose estimation network, , we follow [Godard:2019MonoDepth2] but using ResNet-50 instead of ResNet-18 since the former is more accurate. Four convolutional layers are used to convert the ResNet-50 bottleneck features to the 6-DoF relative pose vector (3D translation and rotation). For training the classification block of , i.e., , we use a standard classification pipeline based on convolutions, ReLU and fully connected layers. Finally, we remark that these networks are not required at testing time.
Iv-C Training details
The input images are processed (at training and testing time) at a resolution of (
), where LANCZOS interpolation is performed from theoriginal resolution. As optimizer, we use ADAM with learning rate set as , and the rest of its hyper-parameters remain with default values. The weights
are initialized from available ImageNet pre-training,, and are randomly initialized with Kaiming weights, while the ResNet-50 part of
is also initialized with ImageNet and the rest (convolutional layers to output the pose vector) following Kaiming. The mini-batch size is of 16 images, 50%/50% from real/virtual domains. To minimize over-fitting, we apply standard data augmentation such as horizontal flip, a 50% chance of random brightness, contrast, saturation, and hue jitter with ranges of, , , and , respectively. Remaining hyper-parameters were set as in Eq. (2), in Alg. 1, and in Eq. (6) our mask is set to have values of for traffic participants (vehicles, pedestrians, etc.), for static infrastructure (buildings, road, vegetation, etc.), and for the sky and pixels with depth over (here m).
Iv-D Results and discussion
Iv-D1 Relative depth assessment
We start by assessing MDE in relative terms. Table II presents MonoDEVSNet results (Ours) and those from previous works based on SfM self-supervision. From this table we can draw several observations. Regarding DA, MonoDEVSNet (VK_v1) outperforms Net (VK_v1) in all metrics. The new version of VK (VK_v2) allows us to obtain even better results. MonoDEVSNet with virtual-world supervision outperforms the version with only SfM self-supervision (best result in Table I) in all metrics, no matter the VK version we use. Overall, MonoDEVSNet outperforms most previous methods, being on pair with [Guizilini:2020semantic].
|[Zhou:2017] (Zhou et al.)||0.183||1.595||6.709||0.270||0.734||0.902||0.959|
|[Zhao:2020] (Zhao et al.)||0.113||0.704||4.581||0.184||0.871||0.961||0.984|
|[Guizilini:2020semantic] (Guizilini et al.)||0.102||0.698||4.381||0.178||0.896||0.964||0.984|
|[Cheng:2020S3Net] Net (VK_v1)||0.124||0.826||4.981||0.200||0.846||0.955||0.982|
Iv-D2 Absolute depth assessment
While assessing depth in relative terms is a reasonable option to compare methods purely based on SfM self-supervision, the most relevant evaluation is in terms of absolute depth. These are presented in Table III. The first (top) block of this table shows results based on depth supervision from LiDAR, thus, a priori they can be thought of as upper-bounds for methods based on self-supervision. The second block shows methods that only use virtual-world supervision. The third and fourth (bottom) blocks show results based on stereo and SfM self-supervision, respectively. Methods in gray use DA supported by VK. We can draw several observations from this table. MonoDEVSNet (Ours) is the best performing among those leveraging supervision from VK_v1 and, consistently with the results on relative depth, by using VK_v2 we improve MonoDEVSNet results. In fact, MonoDEVSNet based on VK_v2 outperforms all self-supervised methods, including those using stereo rigs instead of monocular systems. We are not yet able to reach the performance of the best methods supervised with LiDAR data. However, it is clear that our proposal is able to successfully combine real-world SfM self-supervision and virtual-world supervision. Thus, we think it is worth to keep this line of research until reaching the LiDAR-based upper-bounds.
|[Eigen:2014] (Eigen et al.)||0.203||1.548||6.307||0.282||0.702||0.890||0.890|
|[Liu:2016] (Liu et al.)||0.217||1.841||6.986||0.289||0.647||0.882||0.961|
|[Cao:2017] (Cao et al.)||0.115||N/A||4.712||0.198||0.887||0.963||0.982|
|[Kuznietsov:2017] (Kuzni. et al.)||0.113||0.741||4.621||0.189||0.862||0.960||0.986|
|[Xu:2018] (Xu et al.)||0.122||0.897||4.677||N/A||0.818||0.954||0.985|
|[Gurram:2018] (Gurram et al.)||0.100||0.601||4.298||0.174||0.874||0.966||0.989|
|[Kundu:2018AdaDepth] AdaDepth (VK_v1)||0.167||1.257||5.578||0.237||0.771||0.922||0.971|
|[Zheng:2018T2Net] Net (VK_v1)||0.174||1.410||6.046||0.253||0.754||0.916||0.966|
|[Garg:2016] (Garg et al.)||0.169||1.512||5.763||0.236||0.836||0.935||0.968|
|[Zhao:2019GASDA] GASDA (VK_v1)||0.120||1.022||5.162||0.215||0.848||0.944||0.974|
|[Pnvr:2020SharinGAN] SharinGAN (VK_v1)||0.116||0.939||5.068||0.203||0.850||0.948||0.978|
in this case, the MDE network is pre-trained on Cityscapes dataset[Cordts:2016] and then fine-tuned on KITTI.
|11. All vs. LB||0.061||0.559||1.232||0.063||0.103||0.046||0.018|
|13. All vs. UB||0.016||0.138||0.418||0.021||0.026||0.008||0.003|
Iv-D3 Ablative analysis of MonoDEVSNet
It is also worth to analyze the contribution of the main components of our proposal. In rows 1-6 of Table IV, we add one component at a time showing performance for absolute depth. The 1st row corresponds to using the real-world data with SfM self-supervision and the virtual-world images with only depth supervision, i.e., without using neither semantic supervision (), nor gradient equalization (), nor domain adaptation (), nor mixed mini-batches (), nor the global scaling factor (). By comparing 1st and 2nd rows (i.e., w/o and w/ , resp.), we can see how relevant is obtaining a good global scaling factor to output absolute depth. In fact, adding to the virtual-world depth supervision shows the higher improvement among all the components of our proposal. Then, using mixed mini-batches of real- and virtual-world data improves the performance over alternating mini-batches of only either real- or virtual-world data. This can be seen by comparing 2nd and 3rd rows (i.e., w/o and w/ , resp.). If we alternate the domains, the optimization of a mini-batch is dominated by self-supervision (real-world data), and the optimization of the next mini-batch is dominated by supervision (virtual-world data). Thus, there is not an actual joint optimization of SfM self-supervised and supervised losses, which turns to be relevant. Yet, as can be seen in 4th row, when we add the DA component () we improve further the depth estimation results. As can bee seen in 5th row, adding the equalization () between gradients coming from supervision and self-supervision also improves the depth estimation results. Finally, adding the virtual-world mask () leads to the best performance in 6th row. Overall, this analysis shows how all the considered components are relevant in our proposal. We also remark that these components are needed only to train , but only and are required at testing time. Additionally, we have assessed the effect of simplifying the SfM self-supervised loss that we leverage from [Godard:2019MonoDepth2], here summarized in Sect. III-D. In particular, we neither use the auto-mask (), nor the multi-scale depth loss, and we replaced the minimum re-projection loss by the usual average re-projection loss (i.e., we re-define in Sect. III-D). Results are shown in the 7th row. The metrics show worse values than in 6th row (All), but still outperforming or being on pair with PackNet-SfM and the stereo self-supervised methods of Table III.
In addition to this ablative analysis, we did additional experiments changing the DA mechanism. In particular, instead of taking direct real- and virtual-world images as input to train , a GAN-based CNN, , processes these images to create a common image space in which (hopefully) it is not possible to distinguish the domain. In short, we train an overall CNN , where can come from either the real or the virtual domain, and are the weights of . These weights are jointly trained with all the rest () to optimize depth estimation and minimize the possibility of discriminating the original domain of a sample . Table IV shows results using such a GAN when removing (8th row) and when keeping it (9th row). As we can see, this approach does not improve performance. Moreover, it turns out in a more complex training and would be required at testing time. Thus, we discarded it.
We also assessed the improvement of our proposal with respect a lower-bound model (LB) trained on virtual-world images and their depth GT (), but neither using real-world data (), nor DA (), nor the mask (). Results are shown in 10th row of Table IV, and we explicitly show the improvement of our proposal over such LB in 11th row. Likewise, we have trained an upper-bound model (UB) replacing VK data by KR data with LiDAR-based supervision, so that DA is not required. Results are shown in 12th row, and the distance of our model to this UB is explicitly shown in 13th row. Comparing 11th and 13th rows we can see how we are clearly closer to the UB than to the LB.
Finally, we have done experiments using HRNet-W18 and HRNet-W32. The results are shown in 14th and 15th rows of Table IV, respectively. Indeed, as it happens with the results on relative depth (Table I), HRNet-W48 outperforms these more lightweight versions of HRNet. However, by using HRNet-W18 and HRNet-W32 we still outperform or are on pair with the state-of-the-art self-supervised methods shown in Table III, i.e., those based on stereo self-supervision and PackNet-SfM.
Iv-D4 Qualitative results
Figure 3 presents qualitatively results relying on the depth color map commonly used in the MDE literature. We show results for representative methods in Table III, namely, DORN (LiDAR supervision), SharinGAN (stereo self-supervision and virtual-world supervision), PackNet-SfM (SfM self-supervision and ego-vehicle speed supervision), and MonoDEVSNet (Ours) using VK_v1 and VK_v2 (SfM self-supervision and virtual-world supervision). We also show the corresponding LiDAR-based GT. This GT shows that for LiDAR configurations such as the one used to acquire KITTI dataset, detecting some close vehicles may be problematic since only a few LiDAR points capture their presence. Despite being trained on LiDAR supervision, DORN provides more accurate depth information in these corner cases than the raw LiDAR, which is an example of the relevance of MDE in general. However, DORN shows worse results in these corner cases than the rest (SharinGAN/PackNet-SfM/Ours), even being more accurate in terms of MDE metrics, which focus on global assessment. SharinGAN has more difficulties than PackNet-SfM and our proposal for providing sharp borders in vertical objects/infra-structure (e.g., vehicles, pedestrians, traffic signs, trees). An interesting point to highlight is also the qualitative difference that we observe on our results depending on the use of VK version. In VK_v1 data, vehicle windows appear as transparent to depth, like in many cases happens with LiDAR data, while in VK_v2 they appear as solid. This is translated to the MDE results as we can observe comparing the two bottom rows of Fig. 3. Technically, we think the qualitative results of VK_v2 make more sense since the windows are there at the given depth. However, what we would like to highlight is that we can select one option or another thanks to the use of virtual-world data.
Iv-D5 Additional insights
In terms of qualitative results we think the best performing and most similar approaches are PackNet-SfM and MonoDEVSNet, both relying only on real-world monocular systems. Thus, we perform a deeper comparison of them. First, following the analysis introduced in the PackNet-SfM article [Guizilini:20203D], Fig. 4 plots the abs-rel metric of both methods as a function of depth. Results are similar up to m, then our proposal clearly outperforms PackNet-SfM up to m, where both methods start to perform similarly and in the last part of the range, up to m, PackNet-SfM outperforms our proposal. How these differences translate to the abs-rel global metric depends on the number of pixels falling in each distance range, which we show as an histogram in the same plot. We see how for the KR testing set most of the pixels fall in the m depth range, where both methods perform more similarly. Second, we provide further comparative insights by using KS data since it has associated per-class semantic GT, which we are going to use for evaluation purposes. Note that, although KS is a different data split than the one used in the experiments shown so far (KR), still is KITTI data; thus, we are not yet facing experiments about generalization. Figure 5 compares qualitative results of PackNet-SfM vs. MonoDEVSNet. We can see how PackNet-SfM misses some vehicles that our proposal does not. We believe that these vehicles may be moving at a similar speed w.r.t the ego-vehicle, which may be problematic for pure SfM-based approaches and we hypothesize that virtual-world supervision can help to avoid this problem. Figure 6 shows the corresponding abs-rel metric per-class, focusing on the most relevant classes for driving. Note how the main differences between PackNet-SfM and MonoDEVSNet are observed on vehicles, especially on cars.
Additional qualitative results are added in Fig. 7, where we can see how original images from KR and KS can be rendered as a textured point cloud. In particular, the viewpoint of these renders can change with respect to the original images thanks to the absolute depth values obtained with MonoDEVSNet.
Iv-D6 Generalization results
As done in the previous literature using VK to support MDE [Kundu:2018AdaDepth, Zheng:2018T2Net, Zhao:2019GASDA, Pnvr:2020SharinGAN], we assess generalization on Make3D dataset. As in this literature, we follow the standard data conditioning (cropping and resizing) for models trained on KR, as well as the standard protocol introduced in [Godard:2017] to compute MDE evaluation metrics (e.g. only depth below m is considered). Table V presents the quantitative results usually reported for Make3D, and ours. Note how, in generalization terms, our method also outperforms the rest. Moreover, Fig. 8 shows how our proposal captures the depth structure even better than the depth GT, which is build from depth maps acquired by a 3D scanner.
|[Zheng:2018T2Net] Net (VK_v1)||0.508||6.589||8.935|
|[Kundu:2018AdaDepth] AdaDepth-S (VK_v1)||0.452||5.71||9.559|
|[Zhao:2019GASDA] GASDA (VK_v1)||0.403||6.709||10.424|
|[Pnvr:2020SharinGAN] SharinGAN (VK_v1)||0.377||4.900||8.388|
For on-board perception, we have addressed monocular depth estimation by virtual-world supervision (MonoDEVS) and real-world SfM-inspired self-supervision; the former compensating for the inherent limitations of the latter. This challenging setting allows to rely on a monocular system not only at testing time, but also at training time; a cheap and scalable approach. We have designed a CNN, MonoDEVSNet, which seamlessly trains on real- and virtual-world data, exploiting semantic and depth supervision from the virtual-world data, and addressing the virtual-to-real domain gap by a relatively simple approach which does not add computational complexity in testing time. We have performed a comprehensive set of experiments assessing quantitative results in terms of relative and absolute depth, generalization, and we show the relevance of the components involved on MonoDEVSNet training. Our proposal yields state-of-the-art results within the SfM-based setting, even outperforming stereo-based self-supervised approaches. Qualitative results also confirm that MonoDEVSNet properly captures the depth structure of the images. As a result, we show the usefulness of leveraging virtual-world supervision to ultimately reach the upper-bound performance of methods based on LiDAR supervision. Therefore, our next steps will focus on analyzing the detailed differences between LiDAR-based supervision methods and MonoDEVSNet to find better ways to benefit from virtual-world supervision.