Self-Supervised Joint Learning Framework of Depth Estimation via Implicit Cues

by   Jianrong Wang, et al.

In self-supervised monocular depth estimation, the depth discontinuity and motion objects' artifacts are still challenging problems. Existing self-supervised methods usually utilize a single view to train the depth estimation network. Compared with static views, abundant dynamic properties between video frames are beneficial to refined depth estimation, especially for dynamic objects. In this work, we propose a novel self-supervised joint learning framework for depth estimation using consecutive frames from monocular and stereo videos. The main idea is using an implicit depth cue extractor which leverages dynamic and static cues to generate useful depth proposals. These cues can predict distinguishable motion contours and geometric scene structures. Furthermore, a new high-dimensional attention module is introduced to extract clear global transformation, which effectively suppresses uncertainty of local descriptors in high-dimensional space, resulting in a more reliable optimization in learning framework. Experiments demonstrate that the proposed framework outperforms the state-of-the-art(SOTA) on KITTI and Make3D datasets.



There are no comments yet.


page 5

page 6

page 7


Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances

Dense depth estimation from a single image is a key problem in computer ...

Attention meets Geometry: Geometry Guided Spatial-Temporal Attention for Consistent Self-Supervised Monocular Depth Estimation

Inferring geometrically consistent dense 3D scenes across a tuple of tem...

Improved Point Transformation Methods For Self-Supervised Depth Prediction

Given stereo or egomotion image pairs, a popular and successful method f...

Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance

Self-supervised monocular depth estimation presents a powerful method to...

Revisiting Self-Supervised Monocular Depth Estimation

Self-supervised learning of depth map prediction and motion estimation f...

GCNDepth: Self-supervised Monocular Depth Estimation based on Graph Convolutional Network

Depth estimation is a challenging task of 3D reconstruction to enhance t...

The Aleatoric Uncertainty Estimation Using a Separate Formulation with Virtual Residuals

We propose a new optimization framework for aleatoric uncertainty estima...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Depth and ego-motion estimations play essential roles in understanding geometric scenes from videos and images, and have broad applications such as robotics [9] and autonomous driving [6]. Supervised models [21, 12, 27, 43] have obtained depth maps with vibrant details from color images. However, it is difficult and expensive to accurately collect large-scale labels in practice, and these supervised models are only suitable for specific scenarios.

In recent years, self-supervised methods have attracted increasing interests, and there have been some successes  [45, 29, 42, 40, 17, 2]. In the absence of ground truth, one can still recover scene depth and ego-motion from monocular video sequences using self-supervised methods. The key idea is that one can first warp the source view to the target view through the estimated depth of scenes and ego-motion of the camera, and then simultaneously optimize the depth estimation network (DepthNet) and the pose estimation network (PoseNet) by minimizing the view reconstruction loss.

Fig. 1: Outline of the proposed joint learning framework. (a) Overview of the framework: a DepthNet for depth estimation and a PoseNet that takes two stacked input frames to estimate the relative camera pose. The PoseNet encoder is a unit stream extraction network (USEN). (b) and (c) are the detailed structures of the two proposed modules, i.e., IDCE and HAM, respectively. IDCE extracts static and dynamic depth cues by cascading four identical bottlenecks, and HAM extracts global dynamic transformation from the unit stream using convolutions and Gaussian kernel. IDCE connects the USEN and the DepthNet decoder. For training, source view , and IDCE is valid for , while for evaluation, .

However, this framework has the following deficiencies. (1) The DepthNet only uses the static information of the current view and does not effectively utilize the rich dynamic and static depth cues between adjacent views. Hence, the predicted depth sizes of adjacent frames are inconsistent, and the depth between frames shows a leap of change. (2) The existing masking methods [42, 32, 47] filter out pixels where there is object motion in the scene, so failures on motion scenes are not penalized enough. More precisely, the network comes to a deadlock and cannot seek the global optimal solution, which results in artifacts of moving objects.

The main goal of this work is to effectively alleviate the problems of depth discontinuity and motion objects’ artifacts as mentioned above. We take advantage of the temporal and spatial information changes between consecutive frames, which we call unit stream in this work. Based on the unit stream and the mainstream framework adopted in  [45, 2, 29] consisting of the DepthNet and PoseNet, we propose a novel self-supervised joint learning framework as shown in Fig. 1. This framework introduces two efficient modules to utilize unit stream. (1) The first module implicit depth cues extractor (IDCE) connects the DepthNet and PoseNet. IDCE automatically selects reliable cues to constrain static and dynamic geometric scenes. The unit stream is modeled via statistics of convolutional activations to extract implicit dynamic/static cues and produce powerful depth proposals. The proposals are able to guide subsequent scene depth and make depth estimation near dynamic objects more accurate, while static cues enforce the DepthNet to predict smooth depth changes over consecutive snippets. (2) The second module high-dimensional attention module (HAM) obtains more robust camera pose for accurate view reconstruction. It extracts global dynamic transformation from the unit stream by using convolutions and Gaussian kernel. This module effectively suppresses the uncertainty of the local descriptor in the high-dimensional space and coordinates the depth network to learn better weights. Note that the proposed framework and modules can be generalized to other existing self-supervised depth estimation methods.

To summarize, our main contributions are three-fold:

  • To the best of our knowledge, this is the first work that propose a novel module called IDCE to connect the DepthNet and PoseNet in order to extract implicit static and dynamic depth cues from the shallow space of the unit stream.

  • The novel HAM captures global pose transformation from the unit stream, making the joint learning framework optimization more efficient. Besides, it can be used as a post-processing method for other pose estimation networks.

  • The joint learning framework is extensively evaluated on KITTI [15] and Make3D [34] datasets. Experimental results show that the proposed framework achieves SOTA performance, and outperforms most of recent algorithms by a significant margin.

Ii Related Work

Depth estimation has been studied for a long time. In this section, we mainly discuss the related works based on deep learning from two perspectives: supervised and self-supervised depth estimation.

Ii-a Supervised depth estimation

In recent years, deep learning has made a breakthrough in depth estimation. Supervised depth estimation [27] seeks a mapping from color images to depth maps. Eigen et al.[11]

first employed a multi-scale convolutional neural network, which refined the estimated depth map from low spatial resolution to high spatial resolution. In order to overcome the low-resolution problem, Laina et al.

[25] employed an up-sampling method for learning. Fu et al.[12] introduced a spacing-increasing discretization strategy to discretize depth, and then adopted a multi-scale dilated convolution to capture multi-scale information in parallel.

Although these supervised methods have achieved excellent performance, they need ground truth labels collected by expensive LIDAR [15] or RGBD cameras[35], which place restrictions on usage scenarios or depth ranges.

Ii-B Self-supervised depth estimation

Without requiring the ground truth labels, self-supervised methods use photometric constraints from multiple views, e.g., multiple views captured by a monocular camera or stereo [7, 18, 26, 29, 45]. The following discussions mainly focus on these two aspects.

Ii-B1 Stereo depth estimation

Garg et al. [14] leveraged the epipolar geometry [19] inherent in stereoviews to train the monocular DepthNet, where the photometric consistency loss between stereo pairs is used as the supervision signal. Godard et al. [16] proposed a left-right consistency constraint between left and right disparity maps. In these methods, accurately rectifying stereo cameras provide explicit pose supervision for self-supervised depth estimation.

Ii-B2 Monocular depth estimation

SfMLearner [45] was the first method to learn both depth and ego-motion using the geometric constraints of monocular video. Meanwhile, additional masks ignored moving objects that violated the rigid scene assumption. Following this framework, some approaches in [42, 32, 17, 2, 47] have been proposed to solve the challenge of moving objects. Although they show significant improvements in the performance, they still suffer from ineffective issues from dynamic scenes in a monocular setting. These methods pay little attention to moving areas or discard them directly, and thus the network comes to a deadlock and cannot calculate the global minima in areas with motion. As a result, the DepthNet cannot predict distinguishable motion contours and geometric scene structure. Casser et al. [4] proposed a novel approach that modelled moving objects and produced higher quality results. Besides, it was proposed in [1, 3] that utilizing synthetic data can collect diverse training data. Yang et al. [39] combined the normal and edge geometry to achieve better performance. Very recently, Patil et al. [31]

exploited the recurrent neural network (RNN) to generate a time series of depth maps. Although they use spatio-temporal information, the complex network structure creats huge computational costs during training.

Unlike the works mentioned above, based on the general framework, we propose a new self-supervised joint learning framework that connects the DepthNet and PoseNet to extract implicit static and dynamic depth cues from the shallow space of the unit stream. Moreover, the global dynamic transformation from the unit stream is also exploited.

Iii Method

In this section, we mainly introduce the proposed joint learning framework, which takes the adjacent video frames , as input, and a depth map as output. Details of the two proposed modules, IDCE and HAM, will be described. Before that, we first review the key ideas of the commonly used baseline in self-supervised depth estimation.

Iii-a Algorithm baseline

The baseline consists of two networks, i.e., the DepthNet and the PoseNet. The former one aims to estimate the dense depth map of the target view, and the latter aims to estimate the relative camera pose between nearby views for monocular and mixed (i.e., monocular and stereo) training. In the absence of ground truth, the DepthNet and PoseNet can be solely optimized using the view reconstruction loss between the original target view and the synthesized target view.

According to [45], the view can be synthesized from as:


where is the predicted depth of target view , is the relative camera pose of the source view with respect to the target view , and are the homogeneous coordinates of a pixel in and , respectively. is the camera intrinsic matrix. During training process of the self-supervised model with stereoviews, is the only unknown variable. However, for monocular training, the source view is part of the temporally adjacent frames , and thus the relative camera pose also needs to be predicted. For mixed training, the source view is part of temporally adjacent frames and the opposite stereo view .

Concerning the loss function, following 

[17], a common total loss is composed of photometric loss and smoothness loss:


where is the auto-masking loss, s is the scale index value, and

is a hyperparameter, which is set to be 0.001. The average loss at multiple scales is taken as the final loss.

In equation (2), the photometric loss shown in equation (4) is a combination of the structural similarities (SSIM) [46] and loss for multiple reconstructed views.


where is the index value of pixel coordinates. is a hyper-parameter that set to be 0.85 and denotes:


Here, the per-pixel minimum reprojection loss is adopted to calculate the minimum photometric error at various scales of all source views.

Besides, in equation (2), the edge-aware depth smoothness loss in [16] is also employed.


where is the mean-normalized inverse depth map. denote pixel index value of . and denote gradients in the and directions, respectively. Applying such regularization enforces the DepthNet to produce sharp edge distribution at sharply varying pixels while producing smooth depth in continuous regions.

Methods Dataset Abs Rel Sq Rel RMSE RMSE log 1.25
Eigen et al. [11] K (D) 0.203 1.548 6.307 0.282 0.702 0.890 0.958
Liu et al. [27] K (D) 0.202 1.614 6.523 0.275 0.678 0.895 0.965
Garg et al. [14] K (S) 0.152 1.226 5.849 0.246 0.784 0.921 0.967
Godard et al. [16] CS+K (S) 0.124 1.076 5.311 0.219 0.847 0.942 0.973
Yang et al. [40] K+CS (S) 0.114 1.074 5.836 0.208 0.856 0.939 0.976
Guo et al. [18] K (DS) 0.096 0.641 4.095 0.168 0.892 0.967 0.986
DORN [12] K (D) 0.072 0.307 2.727 0.120 0.932 0.984 0.994
Zhou et al. [45] K (M) 0.208 1.768 6.856 0.283 0.678 0.885 0.957
Yang et al. [41] K (M) 0.182 1.481 6.501 0.267 0.725 0.906 0.963
Mahjourian et al. [29] K (M) 0.163 1.240 6.220 0.250 0.762 0.916 0.968
Wang et al.  [37] K (M) 0.151 1.257 5.583 0.228 0.810 0.936 0.974
GeoNet [42] K (M) 0.149 1.060 5.567 0.226 0.796 0.935 0.975
DF-Net [47] K (M) 0.150 1.124 5.507 0.223 0.806 0.933 0.973
Ranjan et al. [32] K (M) 0.140 1.070 5.326 0.217 0.826 0.941 0.975
Struct2depth [4] K (M) 0.141 1.026 5.291 0.215 0.816 0.945 0.979
SynDeMo [3] K+vK (MD) 0.116 0.746 4.627 0.194 0.858 0.952 0.977
GLNet [7] K (M) 0.135 1.070 5.230 0.210 0.841 0.948 0.980
Zhou et al. [44] (3841248) K (M) 0.121 0.837 4.945 0.197 0.853 0.955 0.982
Monodepth2 [17] K (M) 0.115 0.903 4.863 0.193 0.877 0.959 0.981
Bian et al. [2] K (M) 0.137 1.089 5.439 0.217 0.830 0.942 0.975
Patil et al. [31] K (M) 0.111 0.821 4.650 0.187 0.883 0.961 0.982
Ours (192640) K (M) 0.106 0.799 4.662 0.187 0.889 0.961 0.982
Ours (3201024) K (M) 0.106 0.773 4.491 0.185 0.890 0.962 0.982
Monodepth2 [17] (192 640) K (MS) 0.106 0.818 4.750 0.196 0.874 0.957 0.979
Watson et al. [38] (192 640) K (MSD) 0.106 0.780 4.695 0.193 0.875 0.958 0.980
Ours (192640) K (MS) 0.102 0.776 4.534 0.183 0.893 0.963 0.982
Monodepth2 [17] (3201024) K (MS) 0.106 0.806 4.630 0.193 0.876 0.958 0.980
Watson et al. [38] (3201024) K (MSD) 0.100 0.728 4.469 0.185 0.885 0.962 0.982
Ours(3201024) K (MS) 0.101 0.725 4.360 0.179 0.898 0.965 0.983
GLNet [7] CS+K (M) 0.129 1.044 5.361 0.212 0.843 0.938 0.976
Bian et al. [2] CS+K (M) 0.128 1.047 5.234 0.208 0.846 0.947 0.976
Struct2depth [4] CS+K (M) 0.108 0.825 4.750 0.186 0.873 0.957 0.982
SynDeMo [3] CS+K+vK (MD) 0.112 0.740 4.619 0.187 0.863 0.958 0.983
Ours(192640) CS+K (M) 0.106 0.774 4.623 0.184 0.886 0.962 0.983
TABLE I: Evaluation results of depth estimation on the KITTI test set [11]. The methods trained on KITTI raw dataset [15] are denoted by K, virtual KITTI dataset are denoted by vK, and models with pre-training on CityScapes [8] are denoted by CS+K. M, S and D denotes monocular video, stereo supervision and auxiliary depth supervision, respectively. D means depth supervision. The best results in each category are in bold, and the second best are underlined.
Fig. 2: Qualitative results on the KITTI Eigen split. From top to bottom, the images are input, ground truth, results of Godard et al. [16], Zhou et al. [45], Wang et al. [37], Godard et al. [17], Bian et al. [2] and our KITTI monocular method, respectively. Our method effectively solves the motion blur and artifact, provides a clearer motion contour, and offers sharper predictions of static objects.

Iii-B Self-supervised joint learning framework

Iii-B1 Motivation

The mainstream framework only considers the camera pose information in the unit stream, and ignores the role of depth cues between adjacent frames. Motivated by this, we consider both implicit depth cues and pose information to be important attributes of unit stream, and can act on the appropriate network. The different specific scenarios of unit stream are as follows: (1) static scenes, (2) moving objects, (3) a moving camera relative to static scenes, (4) a stationary camera relative to moving objects, and (5) a moving camera relative to a moving object. On the one hand, all scenarios provide depth cues as a dynamic supplement to the depth information of a single frame. On the other hand, only scenarios (1) and (3) provide camera pose information, which is an essential link in the process of view reconstruction, while (2), (4) and (5) are inappropriate since moving objects in them violate the underlying static scene assumption in view reconstruction. Due to lack of proper supervision signal, the depth estimation network comes to a deadlock in the area of dynamic objects. In addition, a single view cannot provide the dynamic properties of moving objects. PoseNet essentially learns motion information between frames. In order to make full use of inter-frame motion information, we first use the PoseNet encoder as the unit stream extraction network whose outputs are the unit stream to model the shallow space between frames, and then extract dynamic and static depth cues from complex and diverse implicit information in space.

To better extract the implicit cues and estimate the depth in all above mentioned five scenarios, we innovatively propose two modules, i.e., IDCE and HAM based on the mainstream framework. The depth cues extracted by IDCE can be used as a dynamic supplement. The cues are modeled via statistics of convolutional activations, and perform an element-wise sum operation with the feature of the target frame, thereby increasing the proportion of moving objects features. And they can guide subsequent scene depth and make depth estimation near dynamic objects more accurate, while static proposals enforce the DepthNet to predict smooth depth changes over consecutive snippets. HAM can effectively reduce noise caused by moving objects in the cases of (2), (4) and (5) scenarios. The detailed architecture of our proposed framework is illustrated in Fig. 1.

Iii-B2 Implicit depth cues extractor

It is shown in Fig. 1 (b) that the IDCE is an intermediate transition layer that links the stream encoding network and the DepthNet. It is designed to transfer implicit depth cues from the unit stream to the DepthNet. We adjust bottleneck [20] and use it as the basic block. Empirically, we cascade four identical bottlenecks as the final depth cues extractor. As shown in Fig. 1 (b), each bottleneck contains three layers, which are , , and convolutions. The

layers are responsible for reducing the channel number to 1/4 and then restoring dimensions. All layers are performed with a stride of 1. Since the input and output are of the same dimensions, identity shortcuts are directly used.

Iii-B3 High-dimensional attention module

Attention can bias the allocation of available resources towards the most valuable parts of an input signal. Recently, the combination of spatial and channel attention module (CAM) has been successfully applied to a variety of vision tasks [22, 5, 13, 44]. Nonetheless, CAM cannot effectively reduce noise and is insufficiently rich to capture the high dimensional geometric characteristics of multiple views.

Inspired by [5], by extending features to high-dimensions using the Gaussian kernel, we propose a HAM as illustrated in Fig. 1 (c). Given a local feature , where , and denote channel, height and width dimensions, we first feed it into three independent convolution layers to generate three new features K, Q, V, respectively. After that, we perform a Gaussian kernel function between K and Q to find the similarity between each feature point K and Q. Uncertainties in the unit stream introduced by moving objects, occlusions, and incomplete Lambertian surfaces are controlled by the similarity . Besides, it can search for global transformation in multiple views. Then we perform a matrix multiplication between and V. Finally we multiply it by a scale parameter and perform an element-wise sum operation with the feature to obtain global camera pose as follows:



denotes the convolution layer with an activation function,

is a hyperparameter, which is experimentally set to be 0.5. is initialized as 0 and gradually updated as the model learns.

Equation (8

) represents the approximate relationship between two tensors. Each element in Equation (

8) can be expanded into th-order polynomial by Taylor’s expansion:




In equation (10), and are capable of modeling and using the high-order statistics of the local descriptor and (). Thus, we can directly obtain the high-order attention ‘map’ through equation (8). The value is in the interval [0,1]. In equation (11), represents the component representation of the local descriptor in a n-dimensional space. Compared with the method that directly uses and to calculate the attention ‘map’, equation (7) comprehensively considers multi-dimensional similarity. When two tensors have similar components on each feature space, the tensors are globally similar. At the same time, this effectively suppresses the uncertainty of the local descriptor in the high-dimensional space.

HAM can extract global dynamic transformation from the unit stream. The inter-frame features of the original space are mapped to the high-dimensional feature space, which captures more complex and high-order relationships, and matches the global spatial correlation of the original view.

Iii-B4 Network architecture

By integrating the above mentioned two modules into the mainstream framework, we establish a new self-supervised depth estimation framework (see Fig. 1). We rely on successful architecture in [17] as our basic framework. Both DepthNet encoder (DE) and PoseNet encoder (PE) use the same architecture (ResNet18 [20]

) except for the first layer. The first-level convolution channel of the PE is changed from 3 to 6, which allows the adjacent frames to feed into the network. PE is considered as a unit stream extraction network (USEN) and IDCE is used to connect the USEN and the DepthNet decoder (DD). The input size of IDCE is the same as the output of the USEN, and the output size is consistent with the input of the DD. We perform an element-wise sum operation at the last layer of IDCE and DE, then feed the results into DD. The DepthNet adopts a multi-scale architecture and predicts disparity maps with 1, 1/2, 1/4, 1/8 resolutions relative to the color image. HAM is used as the subsequent processing of PoseNet Decoder to obtain the final global 6D ego-motion. For the proposed modules, we adopt batch normalization right after each convolution and before ReLU activation.

Methods Dataset Abs Rel Sq Rel RMSE RMSE log 1.25
Zhou et al. [45] M 0.176 1.532 6.129 0.244 0.758 0.921 0.971
GeoNet [42] M 0.132 0.994 5.240 0.193 0.883 0.953 0.985
Mahjourian et al. [29] M 0.134 0.983 5.501 0.203 0.827 0.944 0.981
EPC++ [28] M 0.120 0.789 4.755 0.177 0.856 0.961 0.987
Monodepth2 [17] (192640) M 0.090 0.545 3.942 0.137 0.914 0.983 0.995
Ours (192640) M 0.082 0.462 3.739 0.127 0.923 0.984 0.996
EPC++ [28] MS 0.123 0.754 4.453 0.172 0.863 0.964 0.989
Monodepth2 [17] (192640) MS 0.080 0.466 3.681 0.127 0.926 0.985 0.995
Ours (192640) MS 0.077 0.431 3.598 0.121 0.931 0.986 0.996
TABLE II: Evaluation results of depth estimation on the KITTI improved ground truth [36].

Iv Experiments

Iv-a Dataset

Our experiments are mainly conducted on KITTI [15], CityScapes [8] and Make3D [34] datasets. The KITTI dataset includes a full suite of raw data such as stereo videos and 3D point clouds. We use 39810 monocular frames and stereo pairs for training, about 4K images for evaluation, and 697 images from the test split [11]. The CityScapes dataset contains various stereo video sequences recorded from 50 different cities. We choose the monocular sequence of the 8-bit image taken by the left monocular camera, and additionally evaluate our model trained by KITTI on Make3D dataset, which is unseen during training to evaluate the generalization ability. Also, we pre-train the network on CityScapes and finetune on KITTI.

As for the experimental metrics, following Zhou et al. [45], we use the following metrics to evaluate our depth estimation method on the KITTI test split and Make3D dataset: (1) Abs Rel, Sq Rel, RMSE and log RMSE (lower the better), and (2) 1.25, , (higher the better).

The median scaling [45] is used to align the predictions with the ground truth during the evaluation. Note that we remove the sequences where the camera does not move between frames during training. During the evaluation, two adjacent frames (, ) are fed to USEN and DE. For discrete samples, such as the first frame of a video, we duplicate each sample to simulate adjacent frames.

Iv-B Implementation details

Our model is implemented with the PyTorch 


framework and a single Tesla V100, trained for 20 epochs, with a batch size of 8. Additionally, random contrast, brightness, saturation, color jittering, horizontal flip, random resizing are used during training. The default input and output resolution is 192

640. At the same time, for comparsions, we also use a larger resolution 3201024 in experiments.

Similar to [17, 2]

, the DE and USEN are initialized by a ResNet-18 backbone pretrained on the ImageNet dataset 

[33]. USEN uses the pre-training weights and removes the weights of the first layer. We adopt Adam [23] optimizer with an initial learning rate of -, and reduce it to 10% after 15 epochs. , and weight decay are set to be 0.9, 0.999 and 0.0001 respectively. In order to alleviate the difficulty of directly optimizing the IDCE and HAM, an effective training strategy is explored to decouple the disparity images from the transformation. More precisely, we first train the baseline and HAM, then jointly train the entire model. It turns out that this strategy leads to superior performances on multiple datasets.

Iv-C Comparisons with the SOTA

In this subsection, our methods are evaluated from both qualitative and quantitative point of views on the KITTI, the Make3D datasets, and further evaluate odometry results on KITTI odometry dataset. Results show that our proposed framework achieves SOTA performance, and outperforms recent algorithms on the depth estimation tasks.

Fig. 3: More qualitative results on KITTI test splits.

Iv-C1 Results on KITTI dataset

We compare the performance of the proposed framework with the baseline, as well as existing SOTA methods as shown in Table I. Results show that our method achieves significant gains over all existing SOTA self-supervised approaches when trained with different types datasets, which are KITTI monocular frames only, KITTI monocular frames and stereo pairs, CityScapes and KITTI monocular frames.

We summarize the main results in Table I as follows: (1) Overall, our method outperforms previous SOTA on the same training setting. Although trained in a self-supervised manner, our method competes quite favorably with most supervised baselines. (2) It is observed that our KITTI monocular and Cytiscapes KITTI model results are slightly lower than [3] on the Sq Rel and RMSE metrics, and the high-resolution monocular-stereo pairs model obtains a second performance 0.101 on Abs Rel, only 0.001 less than the result in [38]. However, it should be mentioned that [3, 38] use a new auxiliary supervision signal while we only use Cytiscapes and KITTI raw data. (3) For KITTI monocular training, our method is slightly better than [31] which is trained by a ConvLSTM-based network with video inputs. Compared with them, we improve the mainstream framework and our method is much simpler and more efficient. (4) It is worth mentioning that our method outperforms recent work [7, 32, 42, 47] that jointly learns multiple tasks as well as complex network structure. (5) Moreover, experimental results that the stereo view, CityScapes pre-training, and high-resolution images can improve the performance of the monocular depth estimation model.

As shown in Table II, we directly compare the proposed method with existing methods on the KITTI improved ground truth from [36]. The imporoved depth provides 652 of the 697 test frames contained in [10]. The predicted depth maps are clipped to 80 meters, and then the full maps are evaluated. The values of the existing methods are reported by [17]. Our method is still significantly better than existing published methods without retraining.

Qualitative results are shown in Fig. 2, where some comparison samples between our KITTI monocular method and some self-supervised baselines are presented. As shown in the first image, compared with other methods, our method provides a clearer motion contour. It also perceives the geometry of static objects and results in a more reasonable depth estimation. Moreover, the depth difference between static overlapping objects can be distinguished significantly. In order to comprehensively visualize the performance of the proposed method, more qualitative results in different cases on the KITTI dataset are shown in Fig. 3.

Method Train Abs Rel Sq Rel RMSE
Zhou et al. [45] M 0.383 5.321 10.470
DDVO [24] M 0.387 4.720 8.090
Monodepth2 [17] M 0.322 3.589 7.417
SynDeMo [3] M 0.330 2.692 6.850
Bian et al. [2] M 0.312
Zhou et al. [44] M 0.318 2.288 6.669
Ours M 0.306 2.056 6.721
TABLE III: Evaluation of depth estimation results on the Make3D test set[34]. The best results in each category are in bold, and the second best are underlined. Full denotes the proposed self-supervised joint learning framework.

Iv-C2 Results on Make3D dataset

In Table III, we directly evaluate our method’s performance on Make3D dataset without any training data on it. Our model is trained on KITTI monocular video without any fine-tuning. Following the evaluation protocol in [45], only using central images where depth is less than 70 meters are evaluated. Our result outperforms existing SOTA methods that do not use depth supervision, showing excellent cross-dataset generalization ability.

Fig. 4: Qualitative results on Make3D dataset. From top to bottom, the images are input, ground truth, and results of our CS+K monocular method, respectively.

Indeed, our method cannot be directly applied on the Make3D dataset because the data is discrete and does not have the inherent dynamic properties of video sequences. In order to adapt our method to this dataset, we resize the test image to 192640 resolution and replicate each sample to simulate adjacent frames for evaluation. The results clearly demonstrate the existence of static depth cues in shallow space are helpful for depth estimation.

Fig. 4 shows some qualitative results of the Make3D dataset, which are estimated by our CS+K model. Both quantitative and qualitative experiments demonstrate the generalization ability of our method in estimating accurate depth maps from consecutive frames.

- Sequence 09 Sequence 10 #frames
ORB-SLAM (full) 0.0140.008 0.0120.011 -
Zhou et al. [45] 0.0210.017 0.0200.015 5
SynDeMo [3] 0.0110.007 0.0110.015 5
Monodepth2 [17] 0.0170.008 0.0150.010 2
Zhou et al. [44] 0.0150.007 0.0150.009 3
GLNet [7] 0.0110.006 0.0110.009 3
Ours 0.0160.008 0.0140.009 2
TABLE IV: Evaluation results on the KITTI odometry dataset. ‘#frames’ is the number of input frames.

Iv-C3 Results on KITTI odometry dataset

For completeness, we evaluate the two-frame model on a five-frame test sequence and combine four frame-to-frame transformation in each group to form a local trajectory. We measure the absolute trajectory error averaged over every 5-frame snippets on sequences 9 and 10. The pose estimation results are summarized in Table IV. Although our method does not exceed SOTA, still reamains a satisfied performance. The main advantage of our method is reflected in the depth estimation task.

Iv-D Ablation study

To analyze individual effects of each component in our framework, we first perform ablation studies on the KITTI and CityScapes by replacing various components. Then our modules are applied to other methods to evaluate its generalization ability. Finally, we experiment on images with different resolutions.

Train Methods Abs Rel Sq Rel RMSE RMSE log 1.25
K (M) Baseline 0.121 0.899 4.934 0.199 0.856 0.955 0.980
K (M) Baseline+IDCE 0.117 0.875 4.829 0.196 0.862 0.956 0.981
K (M) Baseline+HAM 0.112 0.855 4.781 0.190 0.878 0.960 0.981
K (M) Baseline+HAM+IDCE 0.106 0.799 4.662 0.187 0.889 0.961 0.982
K (MS) Baseline 0.114 0.897 4.837 0.193 0.877 0.959 0.981
K (MS) Baseline+IDCE 0.108 0.817 4.677 0.189 0.884 0.960 0.981
K (MS) Baseline+HAM 0.107 0.816 4.663 0.187 0.887 0.961 0.981
K (MS) Baseline+HAM+IDCE 0.102 0.776 4.534 0.183 0.893 0.963 0.982
CS (M) Baseline 0.194 1.340 5.896 0.256 0.697 0.919 0.974
CS (M) Baseline+HAM 0.189 1.354 5.859 0.253 0.706 0.922 0.974
CS (M) Baseline+HAM+IDCE 0.169 1.303 5.706 0.238 0.765 0.933 0.974
TABLE V: Evaluation of each component in our framework on KITTI’s eigen test split. M and S denote monocular video and stereo supervision. K and CS are KITTI and CityScapes datasets.

For ablation studies, we use the baseline [17] and images of 192640 resolution. As shown in Table V, results demonstrate that the proposed modules provide benefits in different perspectives. Compared with IDCE, HAM achieves better performance. We hypothesize that the noise in the unit stream has a great impact on the self-supervised depth estimation task, and the global transformation obtained by HAM can effectively reduce uncertainty. To verify this point, we perform statistics on the number and channel dimensions of the HAM processed feature. As shown in Fig. 5, the visual plane becomes smoother after the processing of HAM. On this basis, we add the IDCE module to transfer implicit depth cues from the unit stream to the DepthNet. The depth cues can be used as a dynamic supplement for DepthNet. When combined them together, our proposed method achieves SOTA results.

Fig. 5: Visualization results of PoseNet feature statistics. (a) is feature statistics without the post-processing by HAM. (b) is feature statistics with the post-processing by HAM.

Fig. 6 shows the comparision of the depth map at the object boundary, including two dynamic scenes and two static scenes. Compared with baseline, it can be seen that IDCE can effectively reduce the motion blur phenomenon by combining the depth information between adjacent frames, and can provide a clearer contour for dynamic or static objects.

Fig. 6: Comparision of the depth map at the object boundary. (a)Predicted depth maps without IDCE.(b)Predicted depth maps with IDCE.
Pose network Attention Module Train Abs Rel Sq Rel RMSE RMSE log 1.25
ResPose CNN - MS 0.114 0.897 4.837 0.193 0.877 0.959 0.981
ResPose CNN CAM MS 0.108 0.853 4.782 0.189 0.886 0.960 0.981
ResPose CNN HAM MS 0.107 0.816 4.663 0.187 0.887 0.961 0.981
Pose CNN - M 0.136 1.067 5.274 0.209 0.840 0.950 0.979
Pose CNN CAM M 0.139 1.080 5.326 0.211 0.831 0.949 0.979
Pose CNN HAM M 0.136 1.024 5.151 0.207 0.849 0.952 0.980
TABLE VI: Evaluation of different attention modules on different PoseNets.
Train Resolution Abs Rel Sq Rel RMSE RMSE log 1.25
K (M) (128416) 0.113 0.864 4.872 0.194 0.872 0.956 0.980
K (M) (192640) 0.106 0.799 4.662 0.187 0.889 0.961 0.982
K (M) (3201024) 0.106 0.773 4.491 0.185 0.890 0.962 0.982
K (MS) (128416) 0.106 0.774 4.623 0.184 0.886 0.962 0.983
K (MS) (192640) 0.102 0.776 4.534 0.183 0.893 0.963 0.982
K (MS) (3201024) 0.101 0.725 4.360 0.179 0.898 0.965 0.983
CS+K (M) (128416) 0.113 0.866 4.843 0.195 0.874 0.955 0.980
CS+K (M) (192640) 0.106 0.774 4.623 0.184 0.886 0.962 0.983
CS+K (M) (3201024) 0.104 0.771 4.463 0.183 0.893 0.963 0.982
TABLE VII: Ablation studies on different image resolutions.

The generalization ability of HAM is evaluated by adopting different PoseNets. It is shown in Table VI the benefit that our HAM brings to ResPose CNN [17] and Pose CNN [45], while the basic attention module such as CAM gains negative yields on Pose CNN. We conjecture that this is due to the noisy unit-stream space provided by the Pose CNN. The basic attention is not conducive to capturing reasonable geometric transformations from sophisticated shallow space, while the HAM produces more robust results.

Table VII shows the results in different training settings on images with different resolutions. It shows that high-resolution images improve performance but increase training time. It takes approximately 9 hours for training the K(M) (128416) model, while (3201024) model takes about 49 hours.

V Conclusion

To solve the depth discontinuity and motion artifact problems, a novel self-supervised joint learning framework is proposed. Our main idea is to take advantage of the unit stream, which represents the spatial and temporal information in consecutive frames. The proposed framework utilizes implicit cues extractor to extract static and dynamic depth cues from unit stream in shallow space, and uses implicit cues to guide the depth estimation of a single image. Moreover, a high-dimensional attention module is introuced to extract global pose information, effectively reducing appearance loss. Extensive experimental results demonstrate that our method outperforms SOTA performance on the KITTI/Make3D dataset by a significant margin, and this framework can be generalized to any self-supervised monocular depth estimation network. For the future work, it is worthwhile to explore a more accurate visual odometry based on this framework.


  • [1] A. A. Abarghouei and T. P. Breckon (2018) Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In CVPR, pp. 2800–2810. Cited by: §II-B2.
  • [2] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M. Cheng, and I. Reid (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. In NeurIPS, Cited by: §I, §I, §II-B2, Fig. 2, TABLE I, §IV-B, TABLE III.
  • [3] B. Bozorgtabar, M. S. Rad, D. Mahapatra, and J. Thiran (2019)

    SynDeMo: synergistic deep feature alignment for joint learning of depth and ego-motion

    In ICCV, Cited by: §II-B2, TABLE I, §IV-C1, TABLE III, TABLE IV.
  • [4] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova (2019)

    Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos

    In AAAI, pp. 8001–8008. Cited by: §II-B2, TABLE I.
  • [5] B. Chen, W. Deng, and J. Hu (2019) Mixed high-order attention network for person re-identification. In ICCV, Cited by: §III-B3, §III-B3.
  • [6] C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) DeepDriving: learning affordance for direct perception in autonomous driving. In ICCV, pp. 2722–2730. Cited by: §I.
  • [7] Y. Chen, C. Schmid, and C. Sminchisescu (2019)

    Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera

    In ICCV, Cited by: §II-B, TABLE I, §IV-C1, TABLE IV.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, Cited by: TABLE I, §IV-A.
  • [9] G. N. Desouza and A. C. Kak (2002) Vision for mobile robot navigation: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2), pp. 237–267. Cited by: §I.
  • [10] D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In

    2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015

    pp. 2650–2658. Cited by: §IV-C1.
  • [11] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In NIPS, pp. 2366–2374. Cited by: §II-A, TABLE I, §IV-A.
  • [12] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In CVPR, pp. 2002–2011. Cited by: §I, §II-A, TABLE I.
  • [13] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In CVPR, pp. 3146–3154. Cited by: §III-B3.
  • [14] R. Garg, B. G. V. Kumar, G. Carneiro, and I. D. Reid (2016) Unsupervised CNN for single view depth estimation: geometry to the rescue. In ECCV, Vol. 9912, pp. 740–756. Cited by: §II-B1, TABLE I.
  • [15] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the KITTI dataset. I. J. Robotics Res. 32 (11), pp. 1231–1237. Cited by: item , §II-A, TABLE I, §IV-A.
  • [16] C. Godard, O. M. Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In CVPR, pp. 6602–6611. Cited by: §II-B1, Fig. 2, §III-A, TABLE I.
  • [17] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth prediction. In ICCV, Cited by: §I, §II-B2, Fig. 2, §III-A, §III-B4, TABLE I, TABLE II, §IV-B, §IV-C1, §IV-D, §IV-D, TABLE III, TABLE IV.
  • [18] X. Guo, H. Li, S. Yi, J. S. J. Ren, and X. Wang (2018) Learning monocular depth by distilling cross-domain stereo networks. In ECCV, Vol. 11215, pp. 506–523. Cited by: §II-B, TABLE I.
  • [19] R. Hartley and A. Zisserman (2004) Multiple view geometry in computer vision. Cambridge University Press. Cited by: §II-B1.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §III-B2, §III-B4.
  • [21] J. Hu, M. Ozay, Y. Zhang, and T. Okatani (2019) Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1043–1051. Cited by: §I.
  • [22] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §III-B3.
  • [23] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Y. Bengio and Y. LeCun (Eds.), Cited by: §IV-B.
  • [24] Y. Kuznietsov, J. Stückler, and B. Leibe (2017) Semi-supervised deep learning for monocular depth map prediction. In CVPR, pp. 2215–2223. Cited by: TABLE III.
  • [25] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 239–248. Cited by: §II-A.
  • [26] R. Li, S. Wang, Z. Long, and D. Gu (2018) UnDeepVO: monocular visual odometry through unsupervised deep learning. In ICRA, pp. 7286–7291. Cited by: §II-B.
  • [27] F. Liu, C. Shen, G. Lin, and I. Reid (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 2024–2039. Cited by: §I, §II-A, TABLE I.
  • [28] C. Luo, Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, and A. L. Yuille (2019) Every pixel counts ++: joint learning of geometry and motion with 3d holistic understanding. TPAMI. Cited by: TABLE II.
  • [29] R. Mahjourian, M. Wicke, and A. Angelova (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In CVPR, pp. 5667–5675. Cited by: §I, §I, §II-B, TABLE I, TABLE II.
  • [30] A. Paszke, S. Gross, S. Chintala, G. Chanan, and E. Yang (2017) Automatic differentiation in pytorch. Cited by: §IV-B.
  • [31] V. Patil, W. V. Gansbeke, D. Dai, and L. V. Gool (2020) Don’t forget the past: recurrent depth estimation from monocular video. CoRR abs/2001.02613. Cited by: §II-B2, TABLE I, §IV-C1.
  • [32] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In CVPR, pp. 12232–12241. Cited by: §I, §II-B2, TABLE I, §IV-C1.
  • [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §IV-B.
  • [34] A. Saxena, M. Sun, and A. Y. Ng (2009) Make3D: learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5), pp. 824–840. Cited by: item , §IV-A, TABLE III.
  • [35] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from RGBD images. In ECCV, Vol. 7576, pp. 746–760. Cited by: §II-A.
  • [36] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger (2017) Sparsity invariant cnns. In 2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pp. 11–20. Cited by: TABLE II, §IV-C1.
  • [37] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey (2018) Learning depth from monocular videos using direct methods. In CVPR, pp. 2022–2030. Cited by: Fig. 2, TABLE I.
  • [38] J. Watson, M. Firman, G. J. Brostow, and D. Turmukhambetov (2019) Self-supervised monocular depth hints. In ICCV, Cited by: TABLE I, §IV-C1.
  • [39] Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia (2018) LEGO: learning edge with geometry all at once by watching videos. In CVPR, pp. 225–234. Cited by: §II-B2.
  • [40] Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia (2018) Every pixel counts: unsupervised geometry learning with holistic 3d motion understanding. In Computer Vision - ECCV 2018 Workshops, Vol. 11133, pp. 691–709. Cited by: §I, TABLE I.
  • [41] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia (2018) Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In AAAI, pp. 7493–7500. Cited by: TABLE I.
  • [42] Z. Yin and J. Shi (2018) GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In CVPR, pp. 1983–1992. Cited by: §I, §I, §II-B2, TABLE I, TABLE II, §IV-C1.
  • [43] H. Zhang, C. Shen, Y. Li, Y. Cao, Y. Liu, and Y. Yan (2019) Exploiting temporal consistency for real-time video depth estimation. Cited by: §I.
  • [44] J. Zhou, Y. Wang, K. Qin, and W. Zeng (2019) Unsupervised high-resolution depth learning from videos with dual networks. In ICCV, Cited by: §III-B3, TABLE I, TABLE III, TABLE IV.
  • [45] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In CVPR, pp. 6612–6619. Cited by: §I, §I, §II-B2, §II-B, Fig. 2, §III-A, TABLE I, TABLE II, §IV-A, §IV-A, §IV-C2, §IV-D, TABLE III, TABLE IV.
  • [46] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §III-A.
  • [47] Y. Zou, Z. Luo, and J. Huang (2018) DF-net: unsupervised joint learning of depth and flow using cross-task consistency. In ECCV, Vol. 11209, pp. 38–55. Cited by: §I, §II-B2, TABLE I, §IV-C1.