Deep feature fusion for self-supervised monocular depth prediction

05/16/2020 ∙ by Vinay Kaushik, et al. ∙ 27

Recent advances in end-to-end unsupervised learning has significantly improved the performance of monocular depth prediction and alleviated the requirement of ground truth depth. Although a plethora of work has been done in enforcing various structural constraints by incorporating multiple losses utilising smoothness, left-right consistency, regularisation and matching surface normals, a few of them take into consideration multi-scale structures present in real world images. Most works utilise a VGG16 or ResNet50 model pre-trained on ImageNet weights for predicting depth. We propose a deep feature fusion method utilising features at multiple scales for learning self-supervised depth from scratch. Our fusion network selects features from both upper and lower levels at every level in the encoder network, thereby creating multiple feature pyramid sub-networks that are fed to the decoder after applying the CoordConv solution. We also propose a refinement module learning higher scale residual depth from a combination of higher level deep features and lower level residual depth using a pixel shuffling framework that super-resolves lower level residual depth. We select the KITTI dataset for evaluation and show that our proposed architecture can produce better or comparable results in depth prediction.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Depth estimation is a fundamental problem of computer vision encompassing wide range of applications in areas such as robotics, augmented reality, action recognition, human pose estimation and scene interaction. Traditional depth predicting algorithms rely of certain assumptions like multi-view camera data, structure from motion or stereopsis for efficiently estimating scene depth. To alleviate this problem, recent deep learning approaches pose depth prediction as a supervised learning problem. These methods depend on large collections of ground truth depth information for training a deep learning model to predict per pixel depth. While these methods work well, there are many scenarios where there is scarce possibility of getting ground truth depth data.

Self-supervised depth prediction solves the depth prediction problem by tackling the problem of image view synthesis. Garg et al.[6] proposed a deep network that enforced a photometric constraint on stereo image data for predicting depth. DVSO [17] uses sparse depth computed from visual odometry pipeline as a supervisory signal for improving depth prediction. Although a lot of work has been done on enforcing multiple constraints on learning depth, like surface normal constraint, left-right consistency constraint, etc. by constructing a loss for the same, not much has been done to enforce changes to the encoder-decoder architecture itself, that has to actually learn those constraints. The most crucial of them is capturing multi-scale structural information present in the scene itself.

In this work, we pose depth prediction as an unsupervised problem, where our model learns to predict multi-scale depth by learning pixel level correspondence between rectified pairs of stereo images with known baseline. Our work optimises the depth prediction pipeline by utilising a novel deep learning framework bringing critical architecture improvements over existing methods enabling us to achieve better accuracy and qualitative improvements in our results. In summary, we make the following contributions: (i) A self-supervised learning architecture utilising multiple feature pyramid sub-networks for feature fusion at every feature scale, (ii) Our model takes advantage of the CoordConv solution[14] by concatenating coordinate channels centered at principal point in every FPN sub-network along with every skip connection, (iii) A residual refinement module that leverages lower resolution residual depth utilizing a residual sub-network, incorporating pixel shuffling for super-resolving predicted depth, (iv) The proposed method achieves state-of-the-art performance on KITTI driving dataset[7] and further improves the visual quality of the estimated depth maps.

2 Related Work

Predicting depth from images has always been a crucial task in computer vision. Majority of traditional approaches use stereo or multi-view images to predict depth. Advent of deep learning led to collecting depth ground truth by expensive scanners, posing depth prediction as a supervised learning problem. We target on a specific domain of monocular depth prediction, posing depth prediction as a self-supervised learning problem, where given only single image as input, we aim to predict it’s depth without considering any scene prior.

2.1 Supervised Depth Prediction

Laina et al.[12] introduced ResNet based FCN architecture for predicting supervised depth. Hu et al.[10] constructed a multi-scale feature fusion module to produce better depth at object edges. Chen et al.[2] formulated a structure-aware residual pyramid network for learning supervised depth.

2.2 Self-supervised Monocular Depth Prediction

Garg et al.[6] presented the problem of predicting depth as a learning problem with depth prediction as an intermediate step of image synthesis. The photometric error is computed to train the deep network. Godard [8] proposed that the depth produced by both left and right images must be consistent with each other and constructed a left-right consistency loss enforcing the same. He also utilized an effective weighted combination of SSIM and L1 loss along with a smoothness constraint to further optimize the predicted depth. Pillai [15]

introduced depth super resolution by using a sub-pixel convolution layer thereby improving depth at higher resolutions. UnDispNet


provided a cascaded residual framework utilising super-resolution for refining depth by utilising multiple autoencoders.

Almalioglu et al.[1] utilised a generative adversarial network by assuming the depth encoder as a generator network and feeding the synthesized image to the discriminator network. Feng et al.[5] constructed a stack of GAN layers, with higher layers estimating spatial features and lower layers learning depth and camera pose. Zhou[19] presented an algorithm to learn depth from a monocular video. His framework comprised of separate depth and pose networks, with the loss constructed by warping temporal views by combining camera pose with the predicted depth of the target image. GeoNet[18] resolved texture ambiguities by incorporating an adaptive geometric consistency loss. Godard[9] proposed a shared encoder framework for predicting depth and pose from monocular videos. He also introduced a minimum photometric error loss that learnt optimal depth at every pixel from a set of temporal frames. UnDeepVO[13] combined both temporal and spatial constraints of photometric consistency and depth consistency to create a single framework for egomotion and depth estimation. GLNet[3] designed a loss to only learn depth from static regions of the image. GLNet also proposed an online refinement strategy that further improved depth prediction and thereby provided an efficient solution for better generalization of the learnt depth on novel datasets.

Figure 2: Qualitative results KITTI diving dataset 2015[7]
Method Dataset Abs Rel Sq Rel RMSE RMSE log D1-all < < <
Deep3D[16] K 0.412 16.37 13.693 0.512 66.85 0.69 0.833 0.891
Deep3Ds[16] K 0.151 1.312 6.344 0.239 59.64 0.781 0.931 0.976
Godard[8] pp K 0.124 1.388 6.125 0.217 30.272 0.841 0.936 0.975 Lower is better
UnDispNet[11] pp K 0.110 1.059 5.571 0.195 28.656 0.858 0.947 0.980 Higher is better
Ours no pp K 0.107 1.056 5.370 0.188 27.379 0.866 0.955 0.983
Ours no fusion K 0.108 1.098 5.425 0.187 27.218 0.869 0.953 0.981
Ours no CoordConv K 0.106 1.082 5.371 0.185 26.727 0.870 0.955 0.983
Ours K 0.104 1.022 5.290 0.185 26.163 0.871 0.956 0.984
Table 1: Results on the KITTI 2015 Stereo 200 training set disparity images[7].For training, K is the KITTI dataset[7]. pp is the disparity post processing specified by [8].

3 Methodology

This section describes the working of our self-supervised depth prediction framework in detail. We introduce a deep feature fusion method, combining features at multiple scales to compute stereo depth without requiring any supervision or ground truth depth. We enforce the CoordConv solution[14] at critical places in our network to optimise scale awareness in our network. We also describe the residual refinement module for learning higher scale residual depth.

3.1 Depth Prediction as View Reconstruction

Our network considers predicting fine grained depth as learning a dense correspondence field that shall be applied to the source image for reconstructing the target image of the scene. In a stereo setup, source and target images are the left and right images respectively. Our networks takes left image as input and predicts depth . Given the left depth and right image , we can generate as the reconstructed left image. Same can be done with the right depth as input. We can simply compute as the metric depth, given baseline and focal length . We compute depth of both left and right images, at 4 scales for training our network.

3.2 Feature Fusion Network

The core of our architecture is the Feature Fusion Network. Our pyramidal encoder utilises a combination of high level and lower level features, learning global image structure as well as the intricate shape details present in the scene. Inspired from [2], our fusion network takes input as features at multiple scales, fuses them based on their neighbourhood and generates a richer set of features as shown in Fig1. Unlike [2], we don’t reshape and feed all the features at all scales, and make sure that the feature at a given scale is given more importance than the feature at a lower or higher scale by reserving more channels for the feature at given scale. The encoder extracts a set of L features , where

is the feature level. The feature shape decreases by 2 at every level as the level increases due to convolution with stride 2. Our feature fusion network computes

taking input as a combination of upsampled features.


where is the encoded feature at level reshaped to feature at level . A convolution operation with stride 1 is applied over the set of fused features to generate a set of richer features at every scale. These features form a new set of encoded features that are then forwarded to the CoordConv block before being send to decoder module.

3.3 The CoordConv Solution

The CoordConv block[14] creates three additional channels for every set of feature it receives. These coordinate channels are simply concatenated to the incoming representations. These channels contain hard-coded coordinates which is one channel for the coordinate, one for the coordinate and one for the polar(radius) coordinate defined as . These channels are created for every feature set before being fed to the feature fusion network. Also, as a standard, these blocks are applied to the skip connections in the fusion encoder-decoder module.

3.4 Residual Refinement Module

Our residual refinement module takes input as fused features generated by the convolutional decoder, which is then fed to a set of convolutional features to compute residual depth. This residual depth is then combined with the predicted lower resolution depth utilising super-resolution for upsampling similar to [15]. The fused depth is then sent to a set of convolutional layers with stride 1, for predicting refined depth. This super-resolved residual architecture induces our model to decipher structurally rich intricate details and refine scene structures while preserving global image layout. We use pixel shuffling after convolving feature with a set of 32, 32, 16, 4 channels to super resolve depth to twice the input resolution.

Method Resolution Dataset Abs Rel Sq Rel RMSE RMSE log < < <
Garg[6] cap 50m 620 x 188 K 0.169 1.080 5.104 0.273 0.740 0.904 0.962
Godard[9] 640 x 192 K 0.115 1.1010 5.164 0.212 0.858 0.946 0.974
GeoNet[18] 416 x 128 K 0.155 1.296 5.857 0.233 0.793 0.931 0.973
UnDeepVO[13] 416 x 128 K 0.183 1.730 6.57 0.268 - - -
Godard[8] 640 x 192 K 0.148 1.344 5.927 0.247 0.803 0.922 0.964
GANVO[1] 416 x 128 K 0.150 0.1141 5.448 0.216 0.808 0.937 0.975
GLNet[2] 416 x 128 K 0.135 1.070 5.230 0.210 0.841 0.948 0.980
SuperDepth[15] 1024 x 384 K 0.112 0.875 4.958 0.207 0.852 0.947 0.977
UnDispNet[11] 1024 x 384 K 0.110 0.892 4.895 0.206 0.868 0.951 0.976
SGANVO[5] 416 x 128 K 0.065 0.673 4.003 0.136 0.944 0.979 0.991
Ours 1024 x 384 K 0.960 0.851 4.386 0.179 0.878 0.962 0.984
Table 2: Self-Supervised depth estimation results on the KITTI dataset [7] using the Eigen Split [4] for depths at cap 80m, as described in [4].

3.5 Loss Function

Our network predicts depth at 4 scales for left input image . Similarly, it predicts depth . Provided multi-scale depth of left and right images, we compute appearance matching loss, disparity smoothness loss, left-right consistency loss[8], with occlusion regularization[17]. Our losses are computed at 4 scales with standard weight factor at every scale[8]. Our architecture learns a combination of these multi-scale losses in an end-to-end manner. Our fusion network selects features from both upper and lower levels at every level in the encoder network, thereby creating multiple feature pyramid sub-networks that are fed to the decoder after applying the CoordConv solution. Our pyramidal encoder utilises a combination of high level and lower level features, learning global image structure as well as the intricate shape details present in the scene. We also propose a refinement module learning higher scale residual depth from a combination of higher level deep features and lower level residual depth using a pixel shuffling framework that super-resolves lower level residual depth.

4 Experiments

We evaluate our method on KITTI Dataset[7] for both KITTI and Eigen [4] test-train data splits for fair comparison. Our network is trained from scratch with input image at 1080x384 resolution. D1-all and depth metrics from [4],[7] are used for comparison.

4.1 Implementation Details

Our base architecture consists of VGG14 model with skip connections. We train our model using Adam optimizer with

as learning rate for 70 epochs, where first 50 epochs are trained on all 4 scales, next 10 epochs on 2 scales and last 10 with no regularization and smoothness loss for fine-tuning. Inferencing takes 30ms using NVIDIA RTX 2080Ti GPU. Our network has only 27 million parameters as compared to a basic ResNet 50 having 44 million parameters

[8] or a stacked module containing 63 million parameters [17].

4.2 KITTI split

We tested our method on 200 stereo images provided by KITTI 2015 Stereo Dataset. Our model preforms drastically well when compared with other methods as shown in Table1. Our architecture outperforms [8] which shows the optimality of our architecture in predicting depth. Feature fusion helps in preserving structure and learning refined depth. Super-resolution makes sure that the learnt residual depth isn’t inconsistent due to sub-optimal sampling. Also, we observe that feature fusion does significant improvements on depth prediction and CoordConv also gives slight performance boost in depth evaluation.

4.3 Eigen Split

As shown in Table 2, we observe that our method performs better than rest of the methods exploiting geometric constraints for self-supervised depth prediction. Our method produces visually rich results as show in Fig2. SGANVO[5] which utilises stacked generators with adversarial losses performs better but predicts depth at smaller resolution. Our method predicts accurate depth estimates and in future can also utilize GANs for further optimization. We show that architecture has as important role in learning better depth as does having better constraints enforced in form of training loss. Following the same paradigm, we observe that our model performs significantly well than other methods trained on similar losses[8, 15, 11, 13]. Compared with the others, our method generates more clearer textures and fine grained details in the predicted depth.

5 Conclusion

In this work, we propose a deep feature fusion based architecture leveraging multiple feature pyramids as sub-networks for an optimal encoder network. We combine encoded features with the CoordConv solution thereby learning robust invariant features refined by a residual decoder that incorporates depth super resolution for learning fine-grained depth. Our model predicts accurate depth at higher resolution than other methods. In future, we would like to use adversarial training scheme along with a separate pose network to facilitate learning by monocular video and further improve the performance.


  • [1] Y. Almalioglu, M. R. U. Saputra, P. P. de Gusmao, A. Markham, and N. Trigoni (2019) GANVO: unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In 2019 International Conference on Robotics and Automation (ICRA), pp. 5474–5480. Cited by: §2.2, Table 2.
  • [2] X. Chen, X. Chen, and Z. Zha (2019) Structure-aware residual pyramid network for monocular depth estimation. In

    Proceedings of the 28th International Joint Conference on Artificial Intelligence

    pp. 694–700. Cited by: §2.1, §3.2, Table 2.
  • [3] Y. Chen, C. Schmid, and C. Sminchisescu (2019) Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7063–7072. Cited by: §2.2.
  • [4] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pp. 2366–2374. Cited by: Table 2, §4.
  • [5] T. Feng and D. Gu (2019) Sganvo: unsupervised deep visual odometry and depth estimation with stacked generative adversarial networks. IEEE Robotics and Automation Letters 4 (4), pp. 4431–4437. Cited by: §2.2, Table 2, §4.3.
  • [6] R. Garg, V. K. BG, G. Carneiro, and I. Reid (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In European Conference on Computer Vision, pp. 740–756. Cited by: §1, §2.2, Table 2.
  • [7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: Deep feature fusion for self-supervised monocular depth prediction, §1, Figure 2, Table 1, Table 2, §4.
  • [8] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 270–279. Cited by: §2.2, Table 1, §3.5, Table 2, §4.1, §4.2, §4.3.
  • [9] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3838. Cited by: §2.2, Table 2.
  • [10] J. Hu, M. Ozay, Y. Zhang, and T. Okatani (2019) Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1043–1051. Cited by: §2.1.
  • [11] V. Kaushik and B. Lall (2019) UnDispNet: unsupervised learning for multi-stage monocular depth prediction. In 2019 International Conference on 3D Vision (3DV), pp. 633–642. Cited by: §2.2, Table 1, Table 2, §4.3.
  • [12] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pp. 239–248. Cited by: §2.1.
  • [13] R. Li, S. Wang, Z. Long, and D. Gu (2018) Undeepvo: monocular visual odometry through unsupervised deep learning. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 7286–7291. Cited by: §2.2, Table 2, §4.3.
  • [14] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski (2018)

    An intriguing failing of convolutional neural networks and the coordconv solution

    In Advances in Neural Information Processing Systems, pp. 9605–9616. Cited by: Deep feature fusion for self-supervised monocular depth prediction, §1, §3.3, §3.
  • [15] S. Pillai, R. Ambruş, and A. Gaidon (2019) Superdepth: self-supervised, super-resolved monocular depth estimation. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9250–9256. Cited by: §2.2, §3.4, Table 2, §4.3.
  • [16] J. Xie, R. Girshick, and A. Farhadi (2016) Deep3d: fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In European Conference on Computer Vision, pp. 842–857. Cited by: Table 1.
  • [17] N. Yang, R. Wang, J. Stuckler, and D. Cremers (2018) Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 817–833. Cited by: §1, §3.5, §4.1.
  • [18] Z. Yin and J. Shi (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1983–1992. Cited by: §2.2, Table 2.
  • [19] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §2.2.