1 Previous Work
Scene depth estimation has been a long standing problem in vision and robotics. Numerous approaches, involving stereo or multi-view depth estimation exist. Recently a learning-based concept for image-to-depth estimation has emerged fueled by availability of rich feature representations, learned from raw data [Eigen, Puhrsch, and Fergus2014, Laina et al.2016]. These approaches have shown compelling results as compared to traditional methods [Karsch, Liu, and Kang2014b]. Pioneering work in unsupervised image-to-depth learning has been proposed by [Zhou et al.2017, Garg, Carneiro, and Reid2016] where no depth or ego-motion is needed as supervision. Many subsequent works have improved the initial results in both the monocular setting [Yang et al.2017, Yin2018] and when using stereo during training [Godard, Aodha, and Brostow2017, Ummenhofer et al.2017, Zhan et al.2018, Yang et al.2018a].
However, these methods still fall short in practice because object movements in dynamic scenes are not handled. In these highly dynamic scenes, the abovementioned methods tend to fail as they can not explain object motion. To that end, optical flow models, trained separately, have been used with moderate improvements [Yin2018, Yang et al.2018b, Yang et al.2018a]. Our motion model is most aligned to these methods as we similarly use a pre-trained model, but propose to use the geometric structure of the scene and model all objects’ motion including camera ego-motion. The refinement method is related to prior work [Bloesch et al.2018] who use lower dimensional representations to fuse subsequent frames; our work shows that this can be done in the original space to a very good quality.
2 Main Method
The main learning setup is unsupervised learning of depth and ego-motion from monocular video [Zhou et al.2017], where the only source of supervision is obtained from the video itself. We here propose a novel approach which is able to model dynamic scenes by modeling object motion, and that can optionally adapt its learning strategy with an online refinement technique. Note that both ideas are tangential and can be used either separately or jointly. We describe them individually, and demonstrate their individual and joint effectiveness in various experiments.
2.1 Problem Setup
The input to the method are sequences of at least three RGB images , as well as camera intrinsics matrix (we use three for simplicity in all derivations below). Depth and ego-motion are predicted by learning nonlinear functions, i.e. neural networks. The depth function is a fully convolutional encoder-decoder architecture producing a dense depth map from a single RGB frame. The ego-motion network
takes a sequence of two RGB images as input and produces the SE3 transform between the frames, i.e. 6-dimensional transformation vectorof the form , specifying translation and rotation parameters between the frames. Similarly, 111For convenience the ego-motion network is implemented to obtain two transformations simultaneously from three RGB frames ..
Using a warping operation of one image to an adjacent one in the sequence, we are able to imagine how a scene would look like from a different camera viewpoint. Since the depth of the scene is available through , the ego-motion to the next frame can translate the scene to the next frame and obtain the next image by projection. More specifically, with a differentiable image warping operator , where is the reconstructed -th image, we can warp any source RGB-image into given corresponding depth estimate and an ego-motion estimate . In practice, performs the warping by reading from transformed image pixel coordinates, setting , where are the projected coordinates. The supervisory signal is then established using a photometric loss comparing the projected scene onto the next frame with the actual next frame image in RGB space, for example using a reconstruction loss: .
2.2 Algorithm Baseline
We establish a strong baseline for our algorithm by following best practices from recent work [Zhou et al.2017, Godard, Aodha, and Brostow2018]. The reconstruction loss is computed as the the minimum reconstruction loss between warping from either the previous frame or the next frame into the middle one:
proposed by [Godard, Aodha, and Brostow2018] to avoid penalization due to significant occlusion/disocclusion effects. In addition to the reconstruction loss, the baseline uses an SSIM [Wang et al.2004] loss, a depth smoothness loss and applies depth normalization during training, which demonstrated success in prior works [Zhou et al.2017, Godard, Aodha, and Brostow2017, Wang et al.2018]. The total loss is applied on scales (
2.3 Motion Model
We introduce an object motion model which shares the same architecture as the ego-motion network , but is specialized to predicting motions of individual objects in 3D (Figure 2). Similar to the ego-motion model, it takes an RGB image sequence as input, but this time complemented by pre-computed instance segmentation masks. The motion model is then tasked to learn to predict the transformation vectors per object in 3D space, which creates the observed object appearance in the respective target frame. Thus, computing warped image frames is now not only a single projection based on ego-motion as in prior work [Zhou et al.2017], but a sequence of projections that are then combined appropriately. The static background is generated by a single warp based on , whereas all segmented objects are then added by their appearance being warped first according to and then . Our approach is conceptually different from prior works which used optical flow for motion in 2D image space [Yin2018] or 3D optical flow [Yang et al.2018a] in that the object motions are explicitly learned in 3D and are available at inference. Our approach not only models objects in 3D but also learns their motion on the fly. This is a principled way of modeling depth independently for the scene and for each individual object.
We define the instance-aligned segmentation masks as per each potential object in the sequence . In order to compute ego-motion, object motions are masked out of the images first. More specifically, we define a binary mask for the static scene , removing all image contents corresponding to potentially moving objects, while for returns a binary mask only for object . The static scene binary mask is applied to all images in the sequence by element-wise multiplication , before feeding the sequence to the ego-motion model:
To model object motion, we first apply the ego-motion estimate to obtain the warped sequences and , where the effect of ego-motion has been removed. Assuming that depth and ego-motion estimates are correct, misalignments within the image sequence are caused only by moving objects. Outlines of potentially moving objects are provided by an off-the-shelf algorithm [He et al.2017] (similar to prior work that use optical flow [Yang et al.2018a] that is not trained on either of the datasets of interest). For every object instance in the image, the object motion estimate of the -th object is computed as:
Note that while represent object motions, they are in fact modeling how the camera would have moved in order to explain the object appearance, rather than the object motion directly. The actual 3D-motion vectors are obtained by tracking the voxel movements before and after the object movement transform in the respective region. Corresponding to these motion estimates, an inverse warping operation is done which moves the objects according to the predicted motions. The final warping result is a combination of the individual warping from moving objects , and the ego-motion . The full warping is:
and the equivalent for . In the above, we denote the gradients per each term. Note that the employed masking ensures that no pixel in the final warping result gets occupied more than once. While there can be regions which are not filled, these are handled implicitly by the minimum loss computation. Our algorithm will automatically learn individual 3D motion per object which can be used at inference.
|Method||Supervised?||Motion?||Cap||Abs Rel||Sq Rel||RMSE||RMSE log|
|Train set mean||-||-||80m||0.361||4.826||8.102||0.377||0.638||0.804||0.894|
|Eigen [Eigen, Puhrsch, and Fergus2014] Coarse||GT Depth||-||80m||0.214||1.605||6.563||0.292||0.673||0.884||0.957|
|Eigen [Eigen, Puhrsch, and Fergus2014] Fine||GT Depth||-||80m||0.203||1.548||6.307||0.282||0.702||0.890||0.958|
|Liu [Liu et al.2015]||GT Depth||-||80m||0.201||1.584||6.471||0.273||0.68||0.898||0.967|
|Zhou [Zhou et al.2017]||-||-||80m||0.208||1.768||6.856||0.283||0.678||0.885||0.957|
|Yang [Yang et al.2017]||-||-||80m||0.182||1.481||6.501||0.267||0.725||0.906||0.963|
|Vid2Depth [Mahjourian, Wicke, and Angelova2018]||-||-||80m||0.163||1.240||6.220||0.250||0.762||0.916||0.968|
|LEGO [Yang et al.2018b]||-||M||80m||0.162||1.352||6.276||0.252||0.783||0.921||0.969|
|DDVO [Wang et al.2018]||-||-||80m||0.151||1.257||5.583||0.228||0.810||0.936||0.974|
|Godard [Godard, Aodha, and Brostow2018]||-||-||80m||0.133||1.158||5.370||0.208||0.841||0.949||0.978|
|Yang [Yang et al.2018a]||-||-||80m||0.137||1.326||6.232||0.224||0.806||0.927||0.973|
|Yang [Yang et al.2018a]||-||M||80m||0.131||1.254||6.117||0.220||0.826||0.931||0.973|
2.4 Imposing Object Size Constraints
A common issue pointed out in previous work is that cars moving in front at roughly the same speed often get projected into infinite depth e.g. [Godard, Aodha, and Brostow2018, Yang et al.2018a]. This is because the object in front shows no apparent motion, and if the network estimates it as being infinitely far away, the reprojection error is almost reduced to zero which is preferred to the correct case. Previous work has pointed out this significant limitation [Godard, Aodha, and Brostow2018] [Yang et al.2018a] [Wang et al.2018] but offered no solution except for augmenting the training dataset with stereo images. However, stereo is not nearly as widely available as monocular video, which will limit the method’s applicability. Instead, we propose a different way of addressing this problem. The main observation we make is that if the model has no knowledge about object scales, it could explain the same object motion by placing an object very far away and predicting very significant motion, assuming it to be very large, or placing it very close and predicting little motion, assuming it to be very small. Our key idea is to let the model learn objects’ scales as part of the training process, thus being able to model objects in 3D. Assuming a weak prior on the height of certain objects, e.g. a car, we can get an approximate depth estimation for it given its segmentation mask and the camera intrinsics using where is the focal length, our height prior in world units, and the height of the respective segmentation blob in pixels. In practice, it is not desirable to estimate such constraints by hand, and the depth prediction scale produced by the network is unknown. Therefore, we let the network learn all constraints simultaneously without requiring additional inputs. Given the above, we define a loss term on the scale of each object (). Let define a category ID for any object , and be a learnable height prior for each category ID . Let be a depth map estimation and the corresponding object outline mask. Then the loss
effectively prevents all segmented objects to degenerate into infinite depth, and forces the network to produce not only a reasonable depth but also matching object motion estimates. We scale by , which is the mean estimated depth of the middle frame, to reduce a potential issue of trivial loss reduction by jointly shrinking priors and the depth prediction range. To our knowledge this is the first method to address common degenerative cases in a fully monocular training setup in 3D. Since this constraint is an integral part of the modeling formulation, the motion models are trained with from the beginning. However, we observed that this additional loss can successfully correct wrong depth estimates when applying it to already trained models, in which case it works by correcting depth for moving objects.
2.5 Test Time Refinement Model
One advantage of having a single-frame depth estimator is its wide applicability. However, this comes at a cost when running continuous depth estimation on image sequences as consecutive predictions are often misaligned or discontinuous. These are caused by two major issues 1) scaling inconsistencies between neighboring frames, since both our and related models have no sense of global scale, and 2) low temporal consistency of depth predictions. In this work we contend that fixing the model weights during inference is not required or needed and being able to adapt the model in an online fashion is advantageous, especially for practical autonomous systems. More specifically, we propose to keep the model training while performing inference, addressing these concerns by effectively performing online optimization. In doing that, we also show that even with very limited temporal resolution (i.e., three-frame sequences), we can significantly increase the quality of depth predictions both qualitatively and quantitatively. Having this low temporal resolution allows our method to still run on-line in real-time, with a typically negligible delay of a single frame. The online refinement is run for steps ( for all experiments) which are effectively fine-tuning the model on-the-fly; determines a good compromise between exploiting the online tuning sufficiently and preventing over-training which can cause artifacts. The online refinement approach can be seamlessly applied to any model including the motion model described above.
3 Experimental Results
Extensive experiments have been conducted on depth estimation, ego-motion estimation and on transfer learning to new environments. We use common metrics and protocols for evaluation adopted by prior methods. With the same standards as in related work, if depth measurements in the groundtruth are invalid or unavailable, they are masked out in the metric computation. We use the following datasets:
KITTI dataset (K). The KITTI dataset [Geiger et al.2013] is the main benchmark for evaluating depth and ego-motion prediction. It has LIDAR sensor readings, used for evaluation only. We use standard splits into training, validation and testing, commonly referred to as the ‘Eigen’ split [Eigen, Puhrsch, and Fergus2014], and evaluate depth predictions up to a fixed range (80 meters).
Cityscapes dataset (C). The Cityscapes dataset [Cordts et al.2016] is another popular and also challenging dataset for autonomous driving. It contains 3250 training and 1250 testing examples which are used in our setup. Of note is that this dataset contains many dynamic scenes with multiple moving objects. We use it for training and for evaluating transfer learning, without fine-tuning.
|Method||Abs Rel||Sq Rel||RMSE||RMSE log|
|Godard [Godard, Aodha, and Brostow2018]*||0.233||3.533||7.412||0.292||0.700||0.892||0.953|
Fetch Indoor Navigation dataset. This dataset is produced by our Fetch robot [Wise et al.2016] collected for the purposes of indoor navigation. We test an even more challenging transfer learning scenario when training on an outdoor navigation dataset, Cityscapes, and testing on the indoor one without fine-tuning. The dataset contains images from a single video sequence, recorded at 8fps.
3.1 Results on the KITTI Dataset
Figure 3 visualizes the results of our method compared to state-of-the-art methods and Table 1 shows quantitative results. Both show a notable improvement over the baseline and over previous methods in the literature. With an absolute relative error of , our method is outperforming competitive models that use motion, [Yang et al.2018a] and [Yin2018]. Furthermore, our results, although monocular, are approaching methods which use stereo or a combination of stereo and monocular, e.g. [Godard, Aodha, and Brostow2017, Kuznietsov, Stuckler, and Leibe2017, Yang et al.2018a, Godard, Aodha, and Brostow2018].
3.1.1 Motion model.
The main contributions of the motion model are that it is able to learn proper depth for moving objects and it learns better ego-motion. Figure 4 shows several examples of dynamic scenes from the Cityscapes dataset, which contain many moving objects. We note that our baseline, which is by itself a top performer on KITTI, is failing on moving objects. Our method makes a notable difference both qualitatively (Figure 4) and quantitatively (see Table 2). Another benefit provided by our motion model is that it learns to predict individual object motions. Figure 6 visualizes the learned motion for individual objects. See the project webpage for a video which demonstrates depth prediction as well as relative speed estimation which is well aligned with the apparent ego-motion of the video.
|Vid2Depth (Mahjourian 2018)|
|Godard (Godard 2018)|
|Zhou (Zhou 2017)|
|GeoNet (Yin 2018)|
3.1.2 Refinement model.
We observe improvements obtained by the refinement model on both KITTI and Cityscapes datasets. Figure 5 shows results of the refinement method only as compared to the baseline. As seen for both evaluating on KITTI or Cityscapes dataset the refinement is helpful in recovering the geometry structure better. In our results we observe that the refinement model is most helpful when testing across datasets, i.e. in data transfer.
3.2 Experimental Results on the Cityscapes Dataset
In this section we evaluate our method on the Cityscapes dataset, where a lot of object motion is present in the training set. Table 2 shows our experimental results when training on the Cityscapes data, and then evaluating on KITTI (without further fine-tuning on KITTI training data). This experiment clearly demonstrates the benefit of our method as we see significant improvements from 0.205 to 0.153 absolute relative error for the proposed approach, which is particularly impressive in the context of state-of-the-art error of 0.233. It is also seen that improvements are accomplished by both the motion and the refinement model individually and jointly. We note that the significant improvement of the combined model stems from both the appropriate depth learning of many moving objects (Figure 4) enabled by the motion component, and the refinement component that actively refines geometry in the scene (Figure 5).
3.3 Visual Odometry Results
Table 3 summarizes our ego-motion results, which are conducted by a standard protocol adopted by prior work [Zhou et al.2017, Godard, Aodha, and Brostow2018] on parts of the KITTI odometry dataset. The total driving sequence lengths tested are 1,702 meters and 918 meters, respectively. As seen our algorithm performance is the best among the state-of-the-art methods, even compared to ones that use more temporal information, or established methods such as ORB-SLAM. Proper handling of motion is the biggest contributor to improving our ego-motion estimation.
3.4 Experiments on Fetch Indoor Navigation Dataset
Finally, we verify the approach in an indoor environment setting, by testing on data collected by the Fetch robot [Wise et al.2016]. This is a particularly challenging transfer learning scenario as training is done on Cityscapes (outdoors) and testing is done on a dataset collected indoors by a different robot platform, representing a significant domain shift between these datasets. Figure 7 visualizes the results on the Fetch data. Our algorithm produces better and more realistic depth estimates and is able to notably improve the baseline method and successfully adapt to new environments. Notably, the algorithm is able to capture well large transparent glass doors and windows and reflective surfaces. We observe that transfer works best if the amount of motion in between frames is somewhat similar. Also, to have additional information available and not lead to degenerate evolution, camera motion should be present. Thus, in a static state, online refinement should not be applied.
The code is implemented in TensorFlow and publicly available. The input images are resized to(with center cropping for Cityscapes). The experiments are run with: learning rate , L1 reconstruction weight , SSIM weight , smoothing weight , object-motion constraint weight (although seems to work better for KITTI), batch size of , L2 weight regularization of . We perform on-the-fly augmentation by horizontal flipping during testing.
4 Conclusions and Future Work
The method presented in this paper addresses the monocular depth and ego-motion problem by modeling individual objects’ motion in 3D. We also propose an online refinement technique which adapts learning on the fly and can transfer to new datasets or environments. The algorithm achieves new state-of-the-art performance on well established benchmarks, and produces higher quality results for dynamic scenes. In the future, we plan to apply the refinement method over longer sequences so as to incorporate more temporal information. Future work will also focus on full 3D scene reconstruction which is enabled by the proposed depth and ego-motion estimation methods.
Acknowledgements. We would like to thank Ayzaan Wahid for helping us with data collection.
- [Bloesch et al.2018] Bloesch, M.; Czarnowski, J.; Clark, R.; Leutenegger, S.; and Davison, A. J. 2018. Codeslam - learning a compact, optimisable representation for dense visual slam. ArXiv: https://arxiv.org/abs/1804.00874.
[Cordts et al.2016]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.;
Franke, U.; Roth, S.; and Schiele, B.
The cityscapes dataset for semantic urban scene understanding.In .
- [Eigen, Puhrsch, and Fergus2014] Eigen, D.; Puhrsch, C.; and Fergus, R. 2014. Depth map prediction from a single image using a multi-scale deep network. NIPS.
- [Garg, Carneiro, and Reid2016] Garg, R.; Carneiro, G.; and Reid, I. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. ECCV.
- [Geiger et al.2013] Geiger, A.; Lenz, P.; Stiller, C.; and Urtasun, R. 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32(11):1231–1237.
- [Godard, Aodha, and Brostow2017] Godard, C.; Aodha, O. M.; and Brostow, G. J. 2017. Unsupervised monocular depth estimation with left-right consistency. CVPR.
- [Godard, Aodha, and Brostow2018] Godard, C.; Aodha, O. M.; and Brostow, G. 2018. Digging into self-supervised monocular depth estimation. arxiv.org/pdf/1806.01260.
- [He et al.2017] He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2980–2988. IEEE.
- [Karsch, Liu, and Kang2014a] Karsch, K.; Liu, C.; and Kang, S. 2014a. Depth extraction from video using nonparametric sampling. IEEE transactions on pattern analysis and machine intelligence 36(11).
- [Karsch, Liu, and Kang2014b] Karsch, K.; Liu, C.; and Kang, S. 2014b. Depth transfer: Depth extraction from video using nonparametric sampling. IEEE Transactions on pattern analysis and machine intelligence.
- [Kuznietsov, Stuckler, and Leibe2017] Kuznietsov, Y.; Stuckler, J.; and Leibe, B. 2017. Sfm-net: Learning of structure and motion from video. CVPR.
- [Ladicky, Zeisl, and Pollefeys2014] Ladicky, L.; Zeisl, B.; and Pollefeys, M. 2014. Discriminatively trained dense surface normal estimation. ECCV.
- [Laina et al.2016] Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; and Navab, N. 2016. Deeper depth prediction with fully convolutional residual networks. arXiv:1606.00373.
- [Li, Klein, and Yao2017] Li, J.; Klein, R.; and Yao, A. 2017. A two-streamed network for estimating fine-scaled depth maps from single rgb images. ICCV.
- [Liu et al.2015] Liu, F.; Shen, C.; Lin, G.; and Reid, I. 2015. Learning depth from single monocular images using deep convolutional neural fields. PAMI.
- [Mahjourian, Wicke, and Angelova2018] Mahjourian, R.; Wicke, M.; and Angelova, A. 2018. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5667–5675.
- [Ummenhofer et al.2017] Ummenhofer, B.; Zhou, H.; Uhrig, J.; Mayer, N.; Ilg, E.; Dosovitskiy, A.; and Brox, T. 2017. Demon: Depth and motion network for learning monocular stereo. CVPR.
- [Wang et al.2004] Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P. 2004. Image quality assessment: from error visibility to structural similarity. Transactions on Image Processing.
- [Wang et al.2018] Wang, C.; Buenaposada, J. M.; Zhu, R.; and Lucey, S. 2018. Learning depth from monocular videos using direct methods. CVPR.
- [Wang, Fouhey, and Gupta2015] Wang, X.; Fouhey, D.; and Gupta, A. 2015. Designing deep networks for surface normal estimation. CVPR.
- [Wise et al.2016] Wise, M.; Ferguson, M.; King, D.; Diehr, E.; and Dymesich, D. 2016. Fetch and freight: Standard platforms for service robot applications. In Workshop on Autonomous Mobile Service Robots.
- [Yang et al.2017] Yang, Z.; Wang, P.; Xu, W.; Zhao, L.; and Nevatia, R. 2017. Unsupervised learning of geometry with edge-aware depth-normal consistency. arXiv:1711.03665.
- [Yang et al.2018a] Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; and Nevatia, R. 2018a. Every pixel counts: Unsupervised geometry learning with holistic 3d motion understanding. arxiv.org/pdf/1806.10556.
- [Yang et al.2018b] Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; and Nevatia, R. 2018b. Lego: Learning edge with geometry all at once by watching videos. CVPR.
- [Yin2018] Yin, Z., S. J. 2018. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. CVPR.
[Zhan et al.2018]
Zhan, H.; Garg, R.; Weerasekera, C.; Li, K.; Agarwal, H.; and Reid, I.
Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction.CVPR.
- [Zhou et al.2017] Zhou, T.; Brown, M.; Snavely, N.; and Lowe, D. 2017. Unsupervised learning of depth and ego-motion from video. CVPR.