Depth sensing plays an important role in 3D scene understanding. For instance, it is crucial for robots to be aware of how far the surrounding objects are away from themselves, which help robots keep clear of obstacles and adjust future behaviour. Recently, learning based single-image depth estimation has attracted a lot of attention due to the rapid progress of Convolutional Neural Networks (CNN). Supervised methods[Eigen et al.2014, Li et al.2015, Liu et al.2016]
aim at learning a mapping from color image to per-pixel depth by neural networks. However, these methods require a large quantity of color-depth pairs and collecting such dataset is challenging, especially in outdoor scenarios. Unsupervised learning gets rid of the dependence on ground truth depth and shows a promising direction. The key idea of the unsupervised learning is to use the warping based image reconstruction loss between adjacent frames to guide the learning process. Several methods have been proposed to use stereo images to estimate depth[Garg et al.2016, Godard et al.2017]. Although no ground truth depth is required, the stereo images are still not as common as monocular videos and they need to be carefully synchronized.
In this paper, we consider unsupervised depth estimation from monocular videos. In practice, videos are ubiquitous and unlimited. Many previous works in this line are based on the static scene assumption [Mahjourian et al.2018, Wang et al.2018, Zhou et al.2017], where only camera motion is considered, leading to inaccurate results for moving objects. Several works tried to explicitly model object motion, either with optical flow [Yin and Shi2018] or SE(3) transforms [Casser et al.2019] by assuming rigid motion of objects. However, yin2018geonet yin2018geonet do not report obvious improvement even with the residual flow learning scheme and casser2018depth casser2018depth only model the motion of rigid objects like cars in driving scenes. For real world scenarios, deformable objects are often present in various forms such as pedestrians and animals.
We observe that the coupling of camera motion and individual object motion often causes ambiguities and may confuse the learning process in dynamic scenes. Therefore we propose to disentangle camera motion and individual object motion between adjacent frames by introducing an additional transformation for every independently moving object to deform itself to adjacent frame, as illustrated in Fig. 1. The transformation is modeled as a bicubic deformation and the transformation parameters are learned by a CNN without any predefined correspondences, which is fully guided by image appearance dissimilarity (see Sec. 3.1 for detail). To realize the individual transformation, the existing instance segmentation method, Mask R-CNN [He et al.2017], is used to segment objects out in each frame, which is only needed at training time and helps the motion representation to learn a better depth network.
The paper has made the following contributions:
We present a learning based approach to estimate depth from unconstrained monocular videos. The approach consists of DepthNet, PoseNet and Region Deformer Networks (RDN), which does not need ground truth supervision.
We propose a deformation based motion representation to model non-rigid motion of individual objects between adjacent frames on 2D images. This representation is general and enables our method to be applicable to unconstrained monocular videos.
We conduct extensive experiments on three datasets across diverse scenes. Our method can not only achieve the state-of-the-art performance on standard benchmarks KITTI and Cityscapes, but also show promising results on a crowded pedestrian tracking dataset, which validates the effectiveness of the proposed deformation based motion representation.
2 Related Works
This section briefly reviews some learning based depth estimation work that is most related to ours.
Supervised Depth Estimation. Many supervised methods have been developed to estimate depth [Eigen et al.2014, Li et al.2015, Liu et al.2016, Xie et al.2016]. These methods use CNN to learn a mapping from RGB images to depth maps. However, they need a training dataset with ground truth depth of real world scenes which is hard to acquire, especially in outdoor scenarios, and hence limits its applicability. Several works try to resolve this limitation by using synthetic data [Mayer et al.2018, Zheng et al.2018] or images from Internet [Chen et al.2016, Li and Snavely2018], but special care must be taken to generate high quality training data, which can be very time-consuming.
Unsupervised Depth Estimation. Unsupervised approaches use image reconstruction loss between adjacent frames to provide self-supervision. garg2016unsupervised garg2016unsupervised propose to use calibrated stereo pairs as supervision to train a single view depth CNN. godard2017unsupervised godard2017unsupervised further improve the performance by imposing left-right consistency constraints. zhou2017unsupervised zhou2017unsupervised propose to learn depth and ego-motion from monocular videos under the static scene assumption, with an additional learned explainability mask to ignore the motion of objects. yin2018geonet yin2018geonet propose to learn a residual flow to handle the motion of objects. zou2018df zou2018df jointly learn depth and optical flow from monocular videos with a cross-task consistency loss in the rigid scene, but the depth of moving objects does not benefit from the learned optical flow since depth and flow consistency is enforced only between static regions.
Our work focuses on depth estimation from monocular videos as videos are more easily available than rectified stereo pairs. The work most similar to ours is [Casser et al.2019]. However, there are two important differences: (i) casser2018depth casser2018depth model object motion in 3D with SE(3) transformation, which is good for rigidly moving objects, like cars in driving scenes. We use a deformation based representation to model object motion in 2D image plane, which is more general to be applicable to diverse real world scenarios. (ii) To handle the common issue that cars moving in front of camera at roughly the same speed are often projected into infinite depth in monocular setting, casser2018depth casser2018depth propose to impose object size constraints depending on the height of object segmentation mask, which is not suitable for deformable objects as the actual scale can be varied over time. Also, the constraints in [Casser et al.2019] are learned by network which can be tricky to find the good hyper-parameters. Instead we choose to use a simple yet efficient prior inspired from [Ranftl et al.2016], which is more general for diverse scenes and has no parameters to learn.
Learning Geometric Transformation.Spatial Transformer Networks (STN) [Jaderberg et al.2015] build the first learnable module in the network architecture to handle geometry variation of input data, which is realized by learning a global parametric transformation. Deformable ConvNets [Dai et al.2017] further extend STN by learning offsets to regular grid sampling locations in the standard convolution. STN and Deformable ConvNets are both aiming at designing network architectures with geometry invariant for supervised tasks like classification and segmentation. Our deformation based motion representation aims at learning a transformation for each of individual objects to model object motion between adjacent frames. WarpNet [Kanazawa et al.2016] shares a similar spirit to match images by learning a transformation, but training WarpNet needs the supervision of artificial correspondences. Our approach is fully unsupervised and in the context of depth estimation from videos.
3 Proposed Method
This section presents our generic framework for unsupervised depth estimation from unconstrained monocular videos. The input is a sequence of video frames , where represent frame ’s height, width and number of channels, respectively. Our goal is to estimate the corresponding depth maps . For this purpose, we build a DepthNet to learn the mapping from color image to per-pixel depth map, a PoseNet to learn the mapping from two adjacent frames to their relative camera pose transformation, and multiple RDNs in parallel to learn the transformations that model the motion of individual objects from one frame to its adjacent frame. The overall framework is illustrated in Fig. 2. For two adjacent frames and , we first obtain instance segmentation masks from the existing Mask R-CNN model. is fed into the DepthNet to predict its depth . The concatenation of and is fed into the PoseNet to learn the camera pose transformation between and , where objects are masked out to avoid motion clue from possibly moving objects. We further use the RDN to model the motion of individual objects in parallel. With and , we reconstruct a synthetic frame corresponding to . The appearance dissimilarity between and provides training signal of our framework. During testing, only the DepthNet is needed to predict the depth for an input frame.
Below we first give the basic formulation of the loss function, and then describe the deformation based motion representation that explicitly handles object motion on 2D images.
3.1 Basic Formulation
With , we synthesize image
corresponding to frame , where is a warping function. We first construct based on the static scene assumption, and then improve it for dynamic scenes by adding individual object motions in Sec. 3.2.
For each pixel coordinates in frame , we can obtain its projected coordinates in frame with the estimated depth and camera transformation
where denotes the homogeneous coordinates of and denotes the camera intrinsic matrix. We use the bilinear mechanism used in [Jaderberg et al.2015] to sample frame to create image .
The appearance dissimilarity between reconstructed image and is defined as
where function is a combination of photometric error and Structure Similarity (SSIM) [Wang et al.2004]:
To handle occlusion/disocclusion between adjacent frames, per-pixel minimum of the dissimilarities with previous frame and next frame is used as proposed in [Godard et al.2018]:
where is reconstructed image from .
3.2 Region Deformer Networks
For dynamic scenes, individual objects may have their own motions besides camera motion. We propose to explicitly model individual object motion on 2D images. Specifically, our goal is to learn the displacement vectorof every pixel belonging to moving objects. Then the corresponding pixel of can be found by . This process is accomplished by learning a function for each object to map to the displacement . The basic requirement for is that it should be flexible enough to model non-rigid object motion. We choose the bicubic function, which is expressed as
The bicubic function is widely used in various applications. It has low computational cost, and meanwhile it contains 32 coefficients, providing sufficient degrees of freedom for modeling freeform deformation (or motion).
To learn the transformation parameters and , we design the Region Deformer Networks (RDN). The workflow of the RDN is illustrated in Fig. 1. Given two adjacent frames and , an instance segmentation model [He et al.2017] is used to segment objects within each frame. Let and denote the -th binary object segmentation masks in frame and , respectively. We first compute the reconstructed image and mask by camera motion using Eq. 1, which eliminates camera motion between adjacent frames. The input of RDN is the concatenation of objects and in and , respectively, where denotes element-wise multiplication, making only the -th object visible for every independent RDN. When multiple objects exist in a single frame, multiple RDNs are used in parallel to model every independent object motion. The outputs of RDN are the parameters of the transformation. By applying the transformation to its corresponding object, we obtain the displacement of the pixels belonging to that object.
Now we are ready to refine the function by defining a new warping function . If a pixel belongs to static background, we compute its correspondence by Eq. 2. If belongs to moving objects, its correspondence is found by , where is from camera motion using Eq. 2 and is from object motion obtained by RDN. In general, for every pixel in frame , we can get its correspondence by
where is the binary instance segmentation mask for : is 1 if belongs to moving objects; otherwise 0, and is the total number of objects in . By modeling individual object motion with RDN, we can get more accurately reconstructed image
where are the individual transformations learned by the RDN. Similarly, we can generate image .
Qualitative comparisons on Eigen test split. GT denotes ground truth depth, which is interpolated for visualization purpose, and the upper regions are cropped as they are not available. Compared with other algorithms, our method can better capture scene structures and moving objects like cars and riding people.
|Method||Dataset||Abs Rel||Sq Rel||RMSE||RMSE log|
|Ours (w/o prior)||K||0.1419||1.1635||5.4739||0.2206||0.8217||0.9416||0.9751|
|Ours (w/o prior)||C||0.1973||1.9500||6.6228||0.2695||0.7200||0.9074||0.9630|
3.3 Object Depth Prior
Depth estimation from monocular videos has an issue that objects moving with camera at the roughly same speed are often projected to infinite depth, as this shows very little appearance change, resulting in low reprojection error [Godard et al.2018]. casser2018depth casser2018depth propose to impose object size constraints by additionally learning the actual scales of objects. These constraints are internally based on the assumption that object scales are fixed. However, in real world scenarios, deformable objects are often present, like pedestrians and animals, which are not applicable for these constrains. Furthermore, as the actual scales are also learned during the training process, it can be tricky to find the good hyper-parameters.
We propose to use a simple yet efficient prior inspired from [Ranftl et al.2016]: objects are supported by their surrounding environment, which is often true in most real world scenes. This prior can be used by requiring the depths of moving objects to be smaller or equal to their horizontal neighbors. However, noticing that overlapping objects may exist in the real world, which might violate this depth prior, we thus introduce a soft constraint, which is formulated as
where is the mean depth of an individual object, is the mean depth of its horizontal neighbors in a small range, and is a small positive number to handle exceptions violating our depth prior. The main idea here is to use the depth prior to prevent the degenerated cases of infinite depth. Note that if this prior is satisfied, which happens most of the time, Eq. 11 actually has no use (i.e., the loss becomes ).
By incorporating the RDN and object depth prior, our final loss function can be expressed as
|Method (Objects)||Dataset||Abs Rel||Sq Rel||RMSE||RMSE log|
We conduct experiments on several datasets across diverse scenes, including not only standard benchmarks KITTI and Cityscapes, but also a publicly available pedestrian tracking dataset. The details of each dataset are given below.
KITTI. The KITTI dataset [Geiger et al.2012] is a popular benchmark for scene understanding in outdoor driving scenario. Only the monocular video sequences are used for training and no ground truth depth is needed. Following the approach of [Zhou et al.2017] to pre-process the dataset, we randomly split the processed images as training and validation sets (about 40K images for training and 4K images for validation). The performance is evaluated on Eigen split [Eigen et al.2014] using the standard evaluation protocol.
Cityscapes. The Cityscapes [Cordts et al.2016] is an outdoor driving dataset similar to KITTI, but with more moving objects. We do the same data pre-processing as [Casser et al.2019], resulting in 38,675 images for training. We use this dataset for training and the evaluation is done on Eigen split.
Pedestrian Tracking Dataset. To validate that our motion representation is general enough to model deformable objects, we collect videos from a publicly available pedestrian tracking dataset [Ess et al.2009], which was recorded on a crowded pedestrian zone. This dataset is very challenging as large human deformations are frequently observed, where 9,369 images are used for training.
4.2 Implementation Details
Our method is implemented in TensorFlow. The input images are resized towhen training on KITTI and Cityscapes datasets, and to when training on the pedestrian tracking dataset. The loss weights are set to , respectively. The smoothness weight is set to 0.04 when training on KITTI and the pedestrian tracking dataset, and to 0.008 when training on Cityscapes dataset. The in Eq. 11 is set to 0.01 when training on KITTI and the pedestrian tracking dataset, and to 0.5 when training on Cityscapes dataset due to the different data distributions of the datasets. The batch size is chosen to be 4 and Adam [Kingma and Ba2014] is used to optimize the network with . When training the baseline model, the learning rate is set to . Our motion model is trained with learning rate of and the network weights are initialized from our trained baseline model. We use the same DepthNet and PoseNet architectures as [Casser et al.2019]. Our RDN uses the same architecture as PoseNet except for the last output layer. We will make our code and trained models publicly available to the community.
The KITTI and Cityscapes Datasets. We report our depth estimation results on standard Eigen test split [Eigen et al.2014] of KITTI raw dataset in Tab. 1. All methods are evaluated on KITTI dataset, no matter whether they are trained on KITTI or Cityscapes dataset. We achieve comparable results with [Casser et al.2019] when training on KITTI dataset, which is specially designed for outdoor driving scenes. When training on more dynamic dataset Cityscapes, our performance is consistently better than that of [Casser et al.2019]. More evidences can be seen from the qualitative comparisons in Fig. 3. On the other hand, comparing our motion model with the baseline, the results are consistently better no matter training on KITTI or Cityscapes dataset (see Tab. 1 and the visual results in Fig. 4).
Ablation Study. To further evaluate our proposed bicubic motion representation, we create a new baseline named ‘Ours (rigid)’, where we replace the bicubic function in Eq. 8 with rigid motion representation and keep all the other settings the same. The results given in Tab. 1 clearly demonstrate the superiority of our bicubic motion representation, wining the majority of the metrics.
To evaluate the contribution from the object depth prior, we add another baseline called ‘Ours (w/o prior)’, which disables the prior in our full motion model. As shown in Tab. 1, compared with Ours (Motion), the performance of Ours (w/o prior) degrades a lot, which verifies that our object prior is simple yet efficient to handle the infinite depth issue as explained in Sec. 3.3.
Evaluation on Moving Objects.
In order to further analyze the effect of the proposed algorithm on different types of objects, we classify the objects in Eigen test split and show the results of different algorithms on different object categories. Tab.2 shows the quantitative results on cars (Cars), people (People) and all possibly moving objects (All, include other objects like motorcycle besides cars and people), respectively. Notable improvements can be observed when comparing our full motion model (Bicubic) with baseline (Baseline) no matter training on KITTI (K) or Cityscapes (C) dataset. On the other hand, we replace the bicubic function with rigid motion representation and keep all the other settings the same to fully compare the rigid motion representation with our proposed bicubic motion representation. We denote this model as Rigid, which is the same as Ours(Rigid) in Tab. 1. Our bicubic motion representation is consistently better than the rigid representation, even on the rigid objects (Cars). We analyze the possible reason is that the coupling of rigid representation and depth (see Eq. 2) may suffer from the scale ambiguity issue [Wang et al.2018], which makes learning harder in the unsupervised setting. As a comparison, our bicubic motion representation directly models object motion on 2D image plane, leading to better performance for various objects.
The Pedestrian Tracking Dataset. To illustrate the generality of our proposed bicubic motion representation, we conduct experiments on a crowded pedestrian tracking dataset, which is quite different from KITTI and Cityscapes datasets and particularly challenging due to the presence of many deformable pedestrians. Fig. 5 visualizes the depth prediction results of samples from this dataset. Clear improvements can be seen from our motion model. The promising results show the generality of our proposed bicubic motion representation and indicate that our framework is applicable to unconstrained monocular videos.
We have presented a learning based approach to estimate depth from unconstrained monocular videos. The approach consists of DepthNet, PoseNet and RDN, for which a deformation based bicubic motion representation is proposed to model object motions in diverse scenes. The experimental results on several datasets show the promising performance of the proposed approach and validate the effectiveness of the deformation based motion representation as well. In future, we would like to incorporate more domain knowledge such as non-rigid structure from motion into the learning process in order to improve the depth estimation in dynamic scenes.
Acknowledgements. We would like to thank Dr. Haiyong Jiang for the helpful discussion during this work. The authors are supported by the National Key R&D Program of China (No. 2016YFC0800501), the National Natural Science Foundation of China (No. 61672481), and the Youth Innovation Promotion Association CAS (No. 2018495).
- [Casser et al.2019] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. AAAI, 2019.
- [Chen et al.2016] W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In NIPS, pages 730–738, 2016.
- [Cordts et al.2016] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
- [Dai et al.2017] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. CoRR, abs/1703.06211, 1(2):3, 2017.
- [Eigen et al.2014] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, pages 2366–2374, 2014.
- [Ess et al.2009] A. Ess, B. Leibe, K. Schindler, and Luc Van Gool. Robust multiperson tracking from a mobile platform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10):1831–1846, 2009.
- [Garg et al.2016] R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV, pages 740–756. Springer, 2016.
- [Geiger et al.2012] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pages 3354–3361. IEEE, 2012.
- [Godard et al.2017] C. Godard, O. Aodha, and G. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017.
- [Godard et al.2018] C. Godard, Oisin Mac Aodha, and G. Brostow. Digging into self-supervised monocular depth estimation. arXiv preprint arXiv:1806.01260, 2018.
- [He et al.2017] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, pages 2980–2988. IEEE, 2017.
- [Jaderberg et al.2015] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, pages 2017–2025, 2015.
- [Kanazawa et al.2016] A. Kanazawa, D. Q. Jacobs, and M. Chandraker. Warpnet: Weakly supervised matching for single-view reconstruction. In CVPR, pages 3253–3261, 2016.
- [Kingma and Ba2014] Diederik P Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [Li and Snavely2018] Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, pages 2041–2050, 2018.
[Li et al.2015]
B. Li, C. Shen, Y. Dai, Anton Van Den Hengel, and M. He.
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs.In CVPR, pages 1119–1127, 2015.
- [Liu et al.2016] F. Liu, C. Shen, G. Lin, and I.Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):2024–2039, 2016.
- [Mahjourian et al.2018] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In CVPR, pages 5667–5675, 2018.
[Mayer et al.2018]
N. Mayer, E. Ilg, P. Fischer, C. Hazirbas, D. Cremers, A. Dosovitskiy, and
What makes good synthetic training data for learning disparity and
optical flow estimation?
International Journal of Computer Vision, pages 1–19, 2018.
- [Ranftl et al.2016] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monocular depth estimation in complex dynamic scenes. In CVPR, pages 4058–4066, 2016.
- [Wang et al.2004] Z. Wang, A. C. Bovik, Hamid R Sheikh, and E. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- [Wang et al.2018] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct methods. In CVPR, pages 2022–2030, 2018.
- [Xie et al.2016] J. Xie, R. Girshick, and A. Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In ECCV, pages 842–857. Springer, 2016.
- [Yin and Shi2018] Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, volume 2, 2018.
- [Zheng et al.2018] C. Zheng, T.-J. Cham, and J. Cai. T2net: Synthetic-to-realistic translation for solving single-image depth estimation tasks. In ECCV, pages 767–783, 2018.
- [Zhou et al.2017] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, volume 2, page 7, 2017.
- [Zou et al.2018] Y. Zou, Z. Luo, and J.-B. Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In ECCV, pages 38–55. Springer, 2018.