In self-supervised monocular depth estimation, the depth discontinuity and motion objects' artifacts are still challenging problems. Existing self-supervised methods usually utilize a single view to train the depth estimation network. Compared with static views, abundant dynamic properties between video frames are beneficial to refined depth estimation, especially for dynamic objects. In this work, we propose a novel self-supervised joint learning framework for depth estimation using consecutive frames from monocular and stereo videos. The main idea is using an implicit depth cue extractor which leverages dynamic and static cues to generate useful depth proposals. These cues can predict distinguishable motion contours and geometric scene structures. Furthermore, a new high-dimensional attention module is introduced to extract clear global transformation, which effectively suppresses uncertainty of local descriptors in high-dimensional space, resulting in a more reliable optimization in learning framework. Experiments demonstrate that the proposed framework outperforms the state-of-the-art(SOTA) on KITTI and Make3D datasets.READ FULL TEXT VIEW PDF
Depth and ego-motion estimations play essential roles in understanding geometric scenes from videos and images, and have broad applications such as robotics  and autonomous driving . Supervised models [21, 12, 27, 43] have obtained depth maps with vibrant details from color images. However, it is difficult and expensive to accurately collect large-scale labels in practice, and these supervised models are only suitable for specific scenarios.
In recent years, self-supervised methods have attracted increasing interests, and there have been some successes [45, 29, 42, 40, 17, 2]. In the absence of ground truth, one can still recover scene depth and ego-motion from monocular video sequences using self-supervised methods. The key idea is that one can first warp the source view to the target view through the estimated depth of scenes and ego-motion of the camera, and then simultaneously optimize the depth estimation network (DepthNet) and the pose estimation network (PoseNet) by minimizing the view reconstruction loss.
However, this framework has the following deficiencies. (1) The DepthNet only uses the static information of the current view and does not effectively utilize the rich dynamic and static depth cues between adjacent views. Hence, the predicted depth sizes of adjacent frames are inconsistent, and the depth between frames shows a leap of change. (2) The existing masking methods [42, 32, 47] filter out pixels where there is object motion in the scene, so failures on motion scenes are not penalized enough. More precisely, the network comes to a deadlock and cannot seek the global optimal solution, which results in artifacts of moving objects.
The main goal of this work is to effectively alleviate the problems of depth discontinuity and motion objects’ artifacts as mentioned above. We take advantage of the temporal and spatial information changes between consecutive frames, which we call unit stream in this work. Based on the unit stream and the mainstream framework adopted in [45, 2, 29] consisting of the DepthNet and PoseNet, we propose a novel self-supervised joint learning framework as shown in Fig. 1. This framework introduces two efficient modules to utilize unit stream. (1) The first module implicit depth cues extractor (IDCE) connects the DepthNet and PoseNet. IDCE automatically selects reliable cues to constrain static and dynamic geometric scenes. The unit stream is modeled via statistics of convolutional activations to extract implicit dynamic/static cues and produce powerful depth proposals. The proposals are able to guide subsequent scene depth and make depth estimation near dynamic objects more accurate, while static cues enforce the DepthNet to predict smooth depth changes over consecutive snippets. (2) The second module high-dimensional attention module (HAM) obtains more robust camera pose for accurate view reconstruction. It extracts global dynamic transformation from the unit stream by using convolutions and Gaussian kernel. This module effectively suppresses the uncertainty of the local descriptor in the high-dimensional space and coordinates the depth network to learn better weights. Note that the proposed framework and modules can be generalized to other existing self-supervised depth estimation methods.
To summarize, our main contributions are three-fold:
To the best of our knowledge, this is the first work that propose a novel module called IDCE to connect the DepthNet and PoseNet in order to extract implicit static and dynamic depth cues from the shallow space of the unit stream.
The novel HAM captures global pose transformation from the unit stream, making the joint learning framework optimization more efficient. Besides, it can be used as a post-processing method for other pose estimation networks.
Depth estimation has been studied for a long time. In this section, we mainly discuss the related works based on deep learning from two perspectives: supervised and self-supervised depth estimation.
first employed a multi-scale convolutional neural network, which refined the estimated depth map from low spatial resolution to high spatial resolution. In order to overcome the low-resolution problem, Laina et al. employed an up-sampling method for learning. Fu et al. introduced a spacing-increasing discretization strategy to discretize depth, and then adopted a multi-scale dilated convolution to capture multi-scale information in parallel.
Without requiring the ground truth labels, self-supervised methods use photometric constraints from multiple views, e.g., multiple views captured by a monocular camera or stereo [7, 18, 26, 29, 45]. The following discussions mainly focus on these two aspects.
Garg et al.  leveraged the epipolar geometry  inherent in stereoviews to train the monocular DepthNet, where the photometric consistency loss between stereo pairs is used as the supervision signal. Godard et al.  proposed a left-right consistency constraint between left and right disparity maps. In these methods, accurately rectifying stereo cameras provide explicit pose supervision for self-supervised depth estimation.
SfMLearner  was the first method to learn both depth and ego-motion using the geometric constraints of monocular video. Meanwhile, additional masks ignored moving objects that violated the rigid scene assumption. Following this framework, some approaches in [42, 32, 17, 2, 47] have been proposed to solve the challenge of moving objects. Although they show significant improvements in the performance, they still suffer from ineffective issues from dynamic scenes in a monocular setting. These methods pay little attention to moving areas or discard them directly, and thus the network comes to a deadlock and cannot calculate the global minima in areas with motion. As a result, the DepthNet cannot predict distinguishable motion contours and geometric scene structure. Casser et al.  proposed a novel approach that modelled moving objects and produced higher quality results. Besides, it was proposed in [1, 3] that utilizing synthetic data can collect diverse training data. Yang et al.  combined the normal and edge geometry to achieve better performance. Very recently, Patil et al. 
exploited the recurrent neural network (RNN) to generate a time series of depth maps. Although they use spatio-temporal information, the complex network structure creats huge computational costs during training.
Unlike the works mentioned above, based on the general framework, we propose a new self-supervised joint learning framework that connects the DepthNet and PoseNet to extract implicit static and dynamic depth cues from the shallow space of the unit stream. Moreover, the global dynamic transformation from the unit stream is also exploited.
In this section, we mainly introduce the proposed joint learning framework, which takes the adjacent video frames , as input, and a depth map as output. Details of the two proposed modules, IDCE and HAM, will be described. Before that, we first review the key ideas of the commonly used baseline in self-supervised depth estimation.
The baseline consists of two networks, i.e., the DepthNet and the PoseNet. The former one aims to estimate the dense depth map of the target view, and the latter aims to estimate the relative camera pose between nearby views for monocular and mixed (i.e., monocular and stereo) training. In the absence of ground truth, the DepthNet and PoseNet can be solely optimized using the view reconstruction loss between the original target view and the synthesized target view.
According to , the view can be synthesized from as:
where is the predicted depth of target view , is the relative camera pose of the source view with respect to the target view , and are the homogeneous coordinates of a pixel in and , respectively. is the camera intrinsic matrix. During training process of the self-supervised model with stereoviews, is the only unknown variable. However, for monocular training, the source view is part of the temporally adjacent frames , and thus the relative camera pose also needs to be predicted. For mixed training, the source view is part of temporally adjacent frames and the opposite stereo view .
Concerning the loss function, following, a common total loss is composed of photometric loss and smoothness loss:
where is the auto-masking loss, s is the scale index value, and
is a hyperparameter, which is set to be 0.001. The average loss at multiple scales is taken as the final loss.
where is the index value of pixel coordinates. is a hyper-parameter that set to be 0.85 and denotes:
Here, the per-pixel minimum reprojection loss is adopted to calculate the minimum photometric error at various scales of all source views.
where is the mean-normalized inverse depth map. denote pixel index value of . and denote gradients in the and directions, respectively. Applying such regularization enforces the DepthNet to produce sharp edge distribution at sharply varying pixels while producing smooth depth in continuous regions.
|Methods||Dataset||Abs Rel||Sq Rel||RMSE||RMSE log||1.25|
|Eigen et al. ||K (D)||0.203||1.548||6.307||0.282||0.702||0.890||0.958|
|Liu et al. ||K (D)||0.202||1.614||6.523||0.275||0.678||0.895||0.965|
|Garg et al. ||K (S)||0.152||1.226||5.849||0.246||0.784||0.921||0.967|
|Godard et al. ||CS+K (S)||0.124||1.076||5.311||0.219||0.847||0.942||0.973|
|Yang et al. ||K+CS (S)||0.114||1.074||5.836||0.208||0.856||0.939||0.976|
|Guo et al. ||K (DS)||0.096||0.641||4.095||0.168||0.892||0.967||0.986|
|DORN ||K (D)||0.072||0.307||2.727||0.120||0.932||0.984||0.994|
|Zhou et al. ||K (M)||0.208||1.768||6.856||0.283||0.678||0.885||0.957|
|Yang et al. ||K (M)||0.182||1.481||6.501||0.267||0.725||0.906||0.963|
|Mahjourian et al. ||K (M)||0.163||1.240||6.220||0.250||0.762||0.916||0.968|
|Wang et al. ||K (M)||0.151||1.257||5.583||0.228||0.810||0.936||0.974|
|GeoNet ||K (M)||0.149||1.060||5.567||0.226||0.796||0.935||0.975|
|DF-Net ||K (M)||0.150||1.124||5.507||0.223||0.806||0.933||0.973|
|Ranjan et al. ||K (M)||0.140||1.070||5.326||0.217||0.826||0.941||0.975|
|Struct2depth ||K (M)||0.141||1.026||5.291||0.215||0.816||0.945||0.979|
|SynDeMo ||K+vK (MD)||0.116||0.746||4.627||0.194||0.858||0.952||0.977|
|GLNet ||K (M)||0.135||1.070||5.230||0.210||0.841||0.948||0.980|
|Zhou et al.  (3841248)||K (M)||0.121||0.837||4.945||0.197||0.853||0.955||0.982|
|Monodepth2 ||K (M)||0.115||0.903||4.863||0.193||0.877||0.959||0.981|
|Bian et al. ||K (M)||0.137||1.089||5.439||0.217||0.830||0.942||0.975|
|Patil et al. ||K (M)||0.111||0.821||4.650||0.187||0.883||0.961||0.982|
|Ours (192640)||K (M)||0.106||0.799||4.662||0.187||0.889||0.961||0.982|
|Ours (3201024)||K (M)||0.106||0.773||4.491||0.185||0.890||0.962||0.982|
|Monodepth2  (192 640)||K (MS)||0.106||0.818||4.750||0.196||0.874||0.957||0.979|
|Watson et al.  (192 640)||K (MSD)||0.106||0.780||4.695||0.193||0.875||0.958||0.980|
|Ours (192640)||K (MS)||0.102||0.776||4.534||0.183||0.893||0.963||0.982|
|Monodepth2  (3201024)||K (MS)||0.106||0.806||4.630||0.193||0.876||0.958||0.980|
|Watson et al.  (3201024)||K (MSD)||0.100||0.728||4.469||0.185||0.885||0.962||0.982|
|GLNet ||CS+K (M)||0.129||1.044||5.361||0.212||0.843||0.938||0.976|
|Bian et al. ||CS+K (M)||0.128||1.047||5.234||0.208||0.846||0.947||0.976|
|Struct2depth ||CS+K (M)||0.108||0.825||4.750||0.186||0.873||0.957||0.982|
|SynDeMo ||CS+K+vK (MD)||0.112||0.740||4.619||0.187||0.863||0.958||0.983|
The mainstream framework only considers the camera pose information in the unit stream, and ignores the role of depth cues between adjacent frames. Motivated by this, we consider both implicit depth cues and pose information to be important attributes of unit stream, and can act on the appropriate network. The different specific scenarios of unit stream are as follows: (1) static scenes, (2) moving objects, (3) a moving camera relative to static scenes, (4) a stationary camera relative to moving objects, and (5) a moving camera relative to a moving object. On the one hand, all scenarios provide depth cues as a dynamic supplement to the depth information of a single frame. On the other hand, only scenarios (1) and (3) provide camera pose information, which is an essential link in the process of view reconstruction, while (2), (4) and (5) are inappropriate since moving objects in them violate the underlying static scene assumption in view reconstruction. Due to lack of proper supervision signal, the depth estimation network comes to a deadlock in the area of dynamic objects. In addition, a single view cannot provide the dynamic properties of moving objects. PoseNet essentially learns motion information between frames. In order to make full use of inter-frame motion information, we first use the PoseNet encoder as the unit stream extraction network whose outputs are the unit stream to model the shallow space between frames, and then extract dynamic and static depth cues from complex and diverse implicit information in space.
To better extract the implicit cues and estimate the depth in all above mentioned five scenarios, we innovatively propose two modules, i.e., IDCE and HAM based on the mainstream framework. The depth cues extracted by IDCE can be used as a dynamic supplement. The cues are modeled via statistics of convolutional activations, and perform an element-wise sum operation with the feature of the target frame, thereby increasing the proportion of moving objects features. And they can guide subsequent scene depth and make depth estimation near dynamic objects more accurate, while static proposals enforce the DepthNet to predict smooth depth changes over consecutive snippets. HAM can effectively reduce noise caused by moving objects in the cases of (2), (4) and (5) scenarios. The detailed architecture of our proposed framework is illustrated in Fig. 1.
It is shown in Fig. 1 (b) that the IDCE is an intermediate transition layer that links the stream encoding network and the DepthNet. It is designed to transfer implicit depth cues from the unit stream to the DepthNet. We adjust bottleneck  and use it as the basic block. Empirically, we cascade four identical bottlenecks as the final depth cues extractor. As shown in Fig. 1 (b), each bottleneck contains three layers, which are , , and convolutions. The
layers are responsible for reducing the channel number to 1/4 and then restoring dimensions. All layers are performed with a stride of 1. Since the input and output are of the same dimensions, identity shortcuts are directly used.
Attention can bias the allocation of available resources towards the most valuable parts of an input signal. Recently, the combination of spatial and channel attention module (CAM) has been successfully applied to a variety of vision tasks [22, 5, 13, 44]. Nonetheless, CAM cannot effectively reduce noise and is insufficiently rich to capture the high dimensional geometric characteristics of multiple views.
Inspired by , by extending features to high-dimensions using the Gaussian kernel, we propose a HAM as illustrated in Fig. 1 (c). Given a local feature , where , and denote channel, height and width dimensions, we first feed it into three independent convolution layers to generate three new features K, Q, V, respectively. After that, we perform a Gaussian kernel function between K and Q to find the similarity between each feature point K and Q. Uncertainties in the unit stream introduced by moving objects, occlusions, and incomplete Lambertian surfaces are controlled by the similarity . Besides, it can search for global transformation in multiple views. Then we perform a matrix multiplication between and V. Finally we multiply it by a scale parameter and perform an element-wise sum operation with the feature to obtain global camera pose as follows:
denotes the convolution layer with an activation function,is a hyperparameter, which is experimentally set to be 0.5. is initialized as 0 and gradually updated as the model learns.
) represents the approximate relationship between two tensors. Each element in Equation (8) can be expanded into th-order polynomial by Taylor’s expansion:
In equation (10), and are capable of modeling and using the high-order statistics of the local descriptor and (). Thus, we can directly obtain the high-order attention ‘map’ through equation (8). The value is in the interval [0,1]. In equation (11), represents the component representation of the local descriptor in a n-dimensional space. Compared with the method that directly uses and to calculate the attention ‘map’, equation (7) comprehensively considers multi-dimensional similarity. When two tensors have similar components on each feature space, the tensors are globally similar. At the same time, this effectively suppresses the uncertainty of the local descriptor in the high-dimensional space.
HAM can extract global dynamic transformation from the unit stream. The inter-frame features of the original space are mapped to the high-dimensional feature space, which captures more complex and high-order relationships, and matches the global spatial correlation of the original view.
By integrating the above mentioned two modules into the mainstream framework, we establish a new self-supervised depth estimation framework (see Fig. 1). We rely on successful architecture in  as our basic framework. Both DepthNet encoder (DE) and PoseNet encoder (PE) use the same architecture (ResNet18 
) except for the first layer. The first-level convolution channel of the PE is changed from 3 to 6, which allows the adjacent frames to feed into the network. PE is considered as a unit stream extraction network (USEN) and IDCE is used to connect the USEN and the DepthNet decoder (DD). The input size of IDCE is the same as the output of the USEN, and the output size is consistent with the input of the DD. We perform an element-wise sum operation at the last layer of IDCE and DE, then feed the results into DD. The DepthNet adopts a multi-scale architecture and predicts disparity maps with 1, 1/2, 1/4, 1/8 resolutions relative to the color image. HAM is used as the subsequent processing of PoseNet Decoder to obtain the final global 6D ego-motion. For the proposed modules, we adopt batch normalization right after each convolution and before ReLU activation.
|Methods||Dataset||Abs Rel||Sq Rel||RMSE||RMSE log||1.25|
|Zhou et al. ||M||0.176||1.532||6.129||0.244||0.758||0.921||0.971|
|Mahjourian et al. ||M||0.134||0.983||5.501||0.203||0.827||0.944||0.981|
|Monodepth2  (192640)||M||0.090||0.545||3.942||0.137||0.914||0.983||0.995|
|Monodepth2  (192640)||MS||0.080||0.466||3.681||0.127||0.926||0.985||0.995|
Our experiments are mainly conducted on KITTI , CityScapes  and Make3D  datasets. The KITTI dataset includes a full suite of raw data such as stereo videos and 3D point clouds. We use 39810 monocular frames and stereo pairs for training, about 4K images for evaluation, and 697 images from the test split . The CityScapes dataset contains various stereo video sequences recorded from 50 different cities. We choose the monocular sequence of the 8-bit image taken by the left monocular camera, and additionally evaluate our model trained by KITTI on Make3D dataset, which is unseen during training to evaluate the generalization ability. Also, we pre-train the network on CityScapes and finetune on KITTI.
As for the experimental metrics, following Zhou et al. , we use the following metrics to evaluate our depth estimation method on the KITTI test split and Make3D dataset: (1) Abs Rel, Sq Rel, RMSE and log RMSE (lower the better), and (2) 1.25, , (higher the better).
The median scaling  is used to align the predictions with the ground truth during the evaluation. Note that we remove the sequences where the camera does not move between frames during training. During the evaluation, two adjacent frames (, ) are fed to USEN and DE. For discrete samples, such as the first frame of a video, we duplicate each sample to simulate adjacent frames.
Our model is implemented with the PyTorch
framework and a single Tesla V100, trained for 20 epochs, with a batch size of 8. Additionally, random contrast, brightness, saturation, color jittering, horizontal flip, random resizing are used during training. The default input and output resolution is 192640. At the same time, for comparsions, we also use a larger resolution 3201024 in experiments.
, the DE and USEN are initialized by a ResNet-18 backbone pretrained on the ImageNet dataset. USEN uses the pre-training weights and removes the weights of the first layer. We adopt Adam  optimizer with an initial learning rate of -, and reduce it to 10% after 15 epochs. , and weight decay are set to be 0.9, 0.999 and 0.0001 respectively. In order to alleviate the difficulty of directly optimizing the IDCE and HAM, an effective training strategy is explored to decouple the disparity images from the transformation. More precisely, we first train the baseline and HAM, then jointly train the entire model. It turns out that this strategy leads to superior performances on multiple datasets.
In this subsection, our methods are evaluated from both qualitative and quantitative point of views on the KITTI, the Make3D datasets, and further evaluate odometry results on KITTI odometry dataset. Results show that our proposed framework achieves SOTA performance, and outperforms recent algorithms on the depth estimation tasks.
We compare the performance of the proposed framework with the baseline, as well as existing SOTA methods as shown in Table I. Results show that our method achieves significant gains over all existing SOTA self-supervised approaches when trained with different types datasets, which are KITTI monocular frames only, KITTI monocular frames and stereo pairs, CityScapes and KITTI monocular frames.
We summarize the main results in Table I as follows: (1) Overall, our method outperforms previous SOTA on the same training setting. Although trained in a self-supervised manner, our method competes quite favorably with most supervised baselines. (2) It is observed that our KITTI monocular and Cytiscapes KITTI model results are slightly lower than  on the Sq Rel and RMSE metrics, and the high-resolution monocular-stereo pairs model obtains a second performance 0.101 on Abs Rel, only 0.001 less than the result in . However, it should be mentioned that [3, 38] use a new auxiliary supervision signal while we only use Cytiscapes and KITTI raw data. (3) For KITTI monocular training, our method is slightly better than  which is trained by a ConvLSTM-based network with video inputs. Compared with them, we improve the mainstream framework and our method is much simpler and more efficient. (4) It is worth mentioning that our method outperforms recent work [7, 32, 42, 47] that jointly learns multiple tasks as well as complex network structure. (5) Moreover, experimental results that the stereo view, CityScapes pre-training, and high-resolution images can improve the performance of the monocular depth estimation model.
As shown in Table II, we directly compare the proposed method with existing methods on the KITTI improved ground truth from . The imporoved depth provides 652 of the 697 test frames contained in . The predicted depth maps are clipped to 80 meters, and then the full maps are evaluated. The values of the existing methods are reported by . Our method is still significantly better than existing published methods without retraining.
Qualitative results are shown in Fig. 2, where some comparison samples between our KITTI monocular method and some self-supervised baselines are presented. As shown in the first image, compared with other methods, our method provides a clearer motion contour. It also perceives the geometry of static objects and results in a more reasonable depth estimation. Moreover, the depth difference between static overlapping objects can be distinguished significantly. In order to comprehensively visualize the performance of the proposed method, more qualitative results in different cases on the KITTI dataset are shown in Fig. 3.
|Method||Train||Abs Rel||Sq Rel||RMSE|
|Zhou et al. ||M||0.383||5.321||10.470|
|Bian et al. ||M||0.312|
|Zhou et al. ||M||0.318||2.288||6.669|
In Table III, we directly evaluate our method’s performance on Make3D dataset without any training data on it. Our model is trained on KITTI monocular video without any fine-tuning. Following the evaluation protocol in , only using central images where depth is less than 70 meters are evaluated. Our result outperforms existing SOTA methods that do not use depth supervision, showing excellent cross-dataset generalization ability.
Indeed, our method cannot be directly applied on the Make3D dataset because the data is discrete and does not have the inherent dynamic properties of video sequences. In order to adapt our method to this dataset, we resize the test image to 192640 resolution and replicate each sample to simulate adjacent frames for evaluation. The results clearly demonstrate the existence of static depth cues in shallow space are helpful for depth estimation.
Fig. 4 shows some qualitative results of the Make3D dataset, which are estimated by our CS+K model. Both quantitative and qualitative experiments demonstrate the generalization ability of our method in estimating accurate depth maps from consecutive frames.
|-||Sequence 09||Sequence 10||#frames|
|Zhou et al. ||0.0210.017||0.0200.015||5|
|Zhou et al. ||0.0150.007||0.0150.009||3|
For completeness, we evaluate the two-frame model on a five-frame test sequence and combine four frame-to-frame transformation in each group to form a local trajectory. We measure the absolute trajectory error averaged over every 5-frame snippets on sequences 9 and 10. The pose estimation results are summarized in Table IV. Although our method does not exceed SOTA, still reamains a satisfied performance. The main advantage of our method is reflected in the depth estimation task.
To analyze individual effects of each component in our framework, we first perform ablation studies on the KITTI and CityScapes by replacing various components. Then our modules are applied to other methods to evaluate its generalization ability. Finally, we experiment on images with different resolutions.
|Train||Methods||Abs Rel||Sq Rel||RMSE||RMSE log||1.25|
For ablation studies, we use the baseline  and images of 192640 resolution. As shown in Table V, results demonstrate that the proposed modules provide benefits in different perspectives. Compared with IDCE, HAM achieves better performance. We hypothesize that the noise in the unit stream has a great impact on the self-supervised depth estimation task, and the global transformation obtained by HAM can effectively reduce uncertainty. To verify this point, we perform statistics on the number and channel dimensions of the HAM processed feature. As shown in Fig. 5, the visual plane becomes smoother after the processing of HAM. On this basis, we add the IDCE module to transfer implicit depth cues from the unit stream to the DepthNet. The depth cues can be used as a dynamic supplement for DepthNet. When combined them together, our proposed method achieves SOTA results.
Fig. 6 shows the comparision of the depth map at the object boundary, including two dynamic scenes and two static scenes. Compared with baseline, it can be seen that IDCE can effectively reduce the motion blur phenomenon by combining the depth information between adjacent frames, and can provide a clearer contour for dynamic or static objects.
|Pose network||Attention Module||Train||Abs Rel||Sq Rel||RMSE||RMSE log||1.25|
|Train||Resolution||Abs Rel||Sq Rel||RMSE||RMSE log||1.25|
The generalization ability of HAM is evaluated by adopting different PoseNets. It is shown in Table VI the benefit that our HAM brings to ResPose CNN  and Pose CNN , while the basic attention module such as CAM gains negative yields on Pose CNN. We conjecture that this is due to the noisy unit-stream space provided by the Pose CNN. The basic attention is not conducive to capturing reasonable geometric transformations from sophisticated shallow space, while the HAM produces more robust results.
Table VII shows the results in different training settings on images with different resolutions. It shows that high-resolution images improve performance but increase training time. It takes approximately 9 hours for training the K(M) (128416) model, while (3201024) model takes about 49 hours.
To solve the depth discontinuity and motion artifact problems, a novel self-supervised joint learning framework is proposed. Our main idea is to take advantage of the unit stream, which represents the spatial and temporal information in consecutive frames. The proposed framework utilizes implicit cues extractor to extract static and dynamic depth cues from unit stream in shallow space, and uses implicit cues to guide the depth estimation of a single image. Moreover, a high-dimensional attention module is introuced to extract global pose information, effectively reducing appearance loss. Extensive experimental results demonstrate that our method outperforms SOTA performance on the KITTI/Make3D dataset by a significant margin, and this framework can be generalized to any self-supervised monocular depth estimation network. For the future work, it is worthwhile to explore a more accurate visual odometry based on this framework.
SynDeMo: synergistic deep feature alignment for joint learning of depth and ego-motion. In ICCV, Cited by: §II-B2, TABLE I, §IV-C1, TABLE III, TABLE IV.
Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In AAAI, pp. 8001–8008. Cited by: §II-B2, TABLE I.
Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In ICCV, Cited by: §II-B, TABLE I, §IV-C1, TABLE IV.
The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: TABLE I, §IV-A.
2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 2650–2658. Cited by: §IV-C1.