[ECCV 2018] DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency
We present an unsupervised learning framework for simultaneously training single-view depth prediction and optical flow estimation models using unlabeled video sequences. Existing unsupervised methods often exploit brightness constancy and spatial smoothness priors to train depth or flow models. In this paper, we propose to leverage geometric consistency as additional supervisory signals. Our core idea is that for rigid regions we can use the predicted scene depth and camera motion to synthesize 2D optical flow by backprojecting the induced 3D scene flow. The discrepancy between the rigid flow (from depth prediction and camera motion) and the estimated flow (from optical flow model) allows us to impose a cross-task consistency loss. While all the networks are jointly optimized during training, they can be applied independently at test time. Extensive experiments demonstrate that our depth and flow models compare favorably with state-of-the-art unsupervised methods.READ FULL TEXT VIEW PDF
[ECCV 2018] DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency
Single-view depth prediction and optical flow estimation are two fundamental problems in computer vision. While the two tasks aim to recover highly correlated information from the scene (i.e., the scene structure and the dense motion field between consecutive frames), existing efforts typically study each problem in isolation. In this paper, we demonstrate the benefits of exploring the geometric relationship between depth, camera motion, and flow for unsupervised learning of depth and flow estimation models.
With the rapid development of deep convolutional neural networks (CNNs), numerous approaches have been proposed to tackle dense prediction problems in an end-to-end manner. However, supervised training CNN for such tasks often involves in constructing large-scale, diverse datasets with dense pixelwise ground truth labels. Collecting such densely labeled datasets in real-world requires significant amounts of human efforts and is prone to error. Existing efforts of RGB-D dataset construction[18, 45, 53, 54] often have limited scope (e.g., in terms of locations, scenes, and objects), and hence are lack of diversity. For optical flow, dense motion annotations are even more difficult to acquire . Consequently, existing CNN-based methods rely on synthetic datasets for training the models [5, 12, 16, 24]. These synthetic datasets, however, do not capture the complexity of motion blur, occlusion, and natural image statistics from real scenes. The trained models usually do not generalize well to unseen scenes without fine-tuning on sufficient ground truth data in a new visual domain.
Several work [17, 21, 28] have been proposed to capitalize on large-scale real-world videos to train the CNNs in the unsupervised setting. The main idea lies to exploit the brightness constancy and spatial smoothness assumptions of flow fields or disparity maps as supervisory signals. These assumptions, however, often do not hold at motion boundaries and hence makes the training unstable.
Many recent efforts [59, 60, 65, 73] explore the geometric relationship between the two problems. With the estimated depth and camera pose, these methods can produce dense optical flow by backprojecting the 3D scene flow induced from camera ego-motion. However, these methods implicitly assume perfect depth and camera pose estimation when “synthesizing” the optical flow. The errors in either depth or camera pose estimation inevitably produce inaccurate flow predictions.
In this paper, we present a technique for jointly learning a single-view depth estimation model and a flow prediction model using unlabeled videos as shown in Figure 2. Our key observation is that the predictions from depth, pose, and optical flow should be consistent with each other. By exploiting this geometry cue, we present a novel cross-task consistency loss that provides additional supervisory signals for training both networks. We validate the effectiveness of the proposed approach through extensive experiments on several benchmark datasets. Experimental results show that our joint training method significantly improves the performance of both models (Figure 1). The proposed depth and flow models compare favorably with state-of-the-art unsupervised methods.
We make the following contributions. (1) We propose an unsupervised learning framework to simultaneously train a depth prediction network and an optical flow network. We achieve this by introducing a cross-task consistency loss that enforces geometric consistency. (2) We show that through the proposed unsupervised training our depth and flow models compare favorably with existing unsupervised algorithms and achieve competitive performance with supervised methods on several benchmark datasets. (3) We release the source code and pre-trained models to facilitate future research: http://yuliang.vision/DF-Net/
Supervised learning using CNNs has emerged to be an effective approach for depth and flow estimation to avoid hand-crafted objective functions and computationally expensive optimization at test time. The availability of RGB-D datasets and deep learning leads to a line of work on single-view depth estimation[13, 14, 35, 38, 62, 72]. While promising results have been shown, these methods rely on the absolute ground truth depth maps. These depth maps, however, are expensive and difficult to collect. Some efforts [8, 74] have been made to relax the difficulty of collecting absolute depth by exploring learning from relative/ordinal depth annotations. Recent work also explores gathering training datasets from web videos  or Internet photos  using structure-from-motion and multi-view stereo algorithms.
Compared to ground truth depth datasets, constructing optical flow datasets of diverse scenes in real-world is even more challenging. Consequently, existing approaches [12, 26, 47] typically rely on synthetic datasets [5, 12] for training. Due to the limited scalability of constructing diverse, high-quality training data, fully supervised approaches often require fine-tuning on sufficient ground truth labels in new visual domains to perform well. In contrast, our approach leverages the readily available real-world videos to jointly train the depth and flow models. The ability to learn from unlabeled data enables unsupervised pre-training for domains with limited amounts of ground truth data.
To alleviate the dependency on large-scale annotated datasets, several works have been proposed to exploit the classical assumptions of brightness constancy and spatial smoothness on the disparity map or the flow field [17, 21, 28, 43, 71]
. The core idea is to treat the estimated depth and flow as latent layers and use them to differentiably warp the source frame to the target frame, where the source and target frames can either be the stereo pair or two consecutive frames in a video sequence. A photometric loss between the synthesized frame and the target frame can then serve as an unsupervised proxy loss to train the network. Using photometric loss alone, however, is not sufficient due to the ambiguity on textureless regions and occlusion boundaries. Hence, the network training is often unstable and requires careful hyper-parameter tuning of the loss functions. Our approach builds upon existing unsupervised losses for training our depth and flow networks. We show that the proposed cross-task consistency loss provides a sizable performance boost over individually trained models.
Recently, a number of work exploits the geometric relationship between depth, camera pose, and flow for learning depth or flow models [60, 65, 68, 73]. These methods first estimate the depth of the input images. Together with the estimated camera poses between two consecutive frames, these methods “synthesize” the flow field of rigid regions. The synthesized flow from depth and pose can either be used for flow prediction in rigid regions [60, 65, 68, 48] as is or used for view synthesis to train depth model using monocular videos . Additional cues such as surface normal , edge , physical constraints  can be incorporated to further improve the performance.
These approaches exploit the inherent geometric relationship between structure and motion. However, the errors produced by either the depth or the camera pose estimation propagate to flow predictions. Our key insight is that for rigid regions the estimated flow (from flow prediction network) and the synthesized rigid flow (from depth and camera pose networks) should be consistent. Consequently, coupled training allows both depth and flow networks to learn from each other and enforce geometrically consistent predictions of the scene.
Joint estimation of structure and camera pose from multiple images of a given scene is a long-standing problem [46, 15, 64]. Conventional methods can recover (semi-)dense depth estimation and camera pose through keypoint tracking/matching. The outputs of these algorithms can potentially be used to help train a flow network, but not the other way around. Our work differs as we are also interested in learning a depth network to recover dense structure from a single input image.
Simultaneously addressing multiple tasks through multi-task learning  has shown advantages over methods that tackle individual ones . For examples, joint learning of video segmentation and optical flow through layered models [6, 56] or feature sharing  helps improve accuracy at motion boundaries. Single-view depth model learning can also benefit from joint training with surface normal estimation [35, 67] or semantic segmentation [13, 30].
Our approach tackles the problems of learning both depth and flow models. Unlike existing multi-task learning methods that often require direct supervision using ground truth training data for each task, our approach instead leverage meta-supervision to couple the training of depth and flow models. While our models are jointly trained, they can be applied independently at test time.
Our goal is to develop an unsupervised learning framework for jointly training the single-view depth estimation network and the optical flow prediction network using unlabeled video sequences. Figure 3 shows the high-level sketch of our proposed approach. Given two consecutive frames sampled from an unlabeled video, we first estimate depth of frame and , and forward-backward optical flow fields between frame and . We then estimate the 6D camera pose transformation between the two frames .
With the predicted depth map and the estimated 6D camera pose, we can produce the 3D scene flow induced from camera ego-motion and backproject them onto the image plane to synthesize the 2D flow (Section 3.2). We refer this synthesized flow as rigid flow. Suppose the scenes are mostly static, the synthesized rigid flow should be consistent with the results from the estimated optical flow (produced by the optical flow prediction model). However, the prediction results from the two branches may not be consistent with each other. Our intuition is that the discrepancy between the rigid flow and the estimated flow provides additional supervisory signals for both networks. Hence, we propose a cross-task consistency loss to enforce this constraint (Section 3.5). To handle non-rigid transformations that cannot be explained by the camera motion and occlusion-disocclusion regions, we exploit the forward-backward consistency check to identify valid regions (Section 3.4). We avoid enforcing the cross-task consistency for those forward-backward inconsistent regions.
Our overall objective function can be formulated as follows:
All of the four loss terms are applied to both depth and flow networks. Also, all of the four loss terms are symmetric for forward and backward directions, for simplicity we only derive them for the forward direction.
Given the two input frames and , the predicted depth map , and relative camera pose , here we wish to establish the dense pixel correspondence between the two frames. Let denotes the 2D homogeneous coordinate of an pixel in frame and denotes the intrinsic camera matrix. We can compute the corresponding point of in frame using the equation :
We can then obtain the synthesized forward rigid flow at pixel in by
Here we briefly review two loss functions that we used in our framework to regularize network training. Leveraging the brightness constancy and spatial smoothness priors used in classical dense correspondence algorithms [4, 23, 40], prior work has used the photometric discrepancy between the warped frame and the target frame as an unsupervised proxy loss function for training CNNs without ground truth annotations.
Suppose that we have frame and , as well as the estimated flow (either from the optical flow predicted from the flow model or the synthesized rigid flow induced from the estimated depth and camera pose), we can produce the warped frame with the inverse warping from frame . Note that the projected image coordinates27] to perform frame synthesis.
With the warped frame from , we formulate the brightness constancy objective function as
where is a function to measure the difference between pixel values. Previous work simply choose norm or the appearance matching loss , which is not invariant to illumination changes in real-world scenarios . Here we adopt the ternary census transform based loss [43, 55, 69] that can better handle complex illumination changes.
The brightness constancy loss is not informative in low-texture or homogeneous region of the scene. To handle this issue, existing work incorporates a smoothness prior to regularize the estimated disparity map or flow field. We adopt the spatial smoothness loss as proposed in .
According to the brightness constancy assumption, the warped frame should be similar to the target frame. However, the assumption does not hold for occluded and dis-occluded regions. We address this problem by using the commonly used forward-backward consistency check technique to identify invalid regions and do not impose the photometric loss on those regions.
We implement the occlusion detection based on forward-backward consistency assumption 
(i.e., traversing flow vector forward and then backward should arrive at the same position). Here we use a simple criterion proposed in. We mark pixels as invalid whenever this constraint is violated. Figure 4 shows two examples of the marked invalid regions by forward-backward consistency check using the synthesized rigid flow (animations can be viewed in Adobe Reader).
Denote the valid region by (either from rigid flow or estimated flow), we can modify the photometric loss term (4) as
In addition to using forward-backward consistency check for identifying invalid regions, we can further impose constraints on the valid regions so that the network can produce consistent predictions for both forward and backward directions. Similar ideas have been exploited in [25, 43] for occlusion-aware flow estimation. Here, we apply the forward-backward consistency loss to both flow and depth predictions.
For flow prediction, the forward-backward consistency loss is of the form:
Similarly, we impose a consistency penalty for depth:
where is warped from using the synthesized rigid flow from to .
While we exploit robust functions for enforcing photometric loss, forward-backward consistency for each of the tasks, the training of depth and flow networks using unlabeled data remains non-trivial and sensitive to the choice of hyper-parameters . Building upon the existing loss functions, in the following we introduce a novel cross-task consistency loss to further regularize the network training.
In Section 3.2, we show that the motion of rigid regions in the scene can be explained by the ego-motion of the camera and the corresponding scene depth. On the one hand, we can estimate the rigid flow by backprojecting the induced 3D scene flow from the estimated depth and relative camera pose. On the other hand, we have direct estimation results from an optical flow network. Our core idea is the that these two flow fields should be consistent with each other for non-occluded and static regions. Minimizing the discrepancy between the two flow fields allows us to simultaneously update the depth and flow models.
We thus propose to minimize the endpoint distance between the flow vectors in the rigid flow (computed from the estimated depth and pose) and that in the estimated flow (computed from the flow prediction model). We denote the synthesized rigid flow as and the estimated flow as . Using the computed valid masks (Section 3.4), we impose the cross-task consistency constraints over valid pixels.
In this section, we validate the effectiveness of our proposed method for unsupervised learning of depth and flow on several standard benchmark datasets. More results can be found in the supplementary material. Our source code and pre-trained models are available on http://yuliang.vision/DF-Net/.
We use video clips from the train split of KITTI raw dataset  for joint learning of depth and flow models. Note that our training does not involve any depth/flow labels.
To avoid the joint training process converging to trivial solutions, we (unsupervisedly) pre-train the flow network on the SYNTHIA dataset . For pre-training both depth and pose networks, we use either KITTI raw dataset or the CityScapes dataset  .
The SYNTHIA dataset  contains multi-view frames captured by driving vehicles in different scenarios and traffic conditions. We take all the four-view images of the left camera from all summer and winter driving sequences, which contains around 37K image pairs. The CityScapes dataset  contains real-world driving sequences, we follow Zhou et al.  and pre-process the dataset to generate around 75K training image pairs.
For evaluating the performance of our depth network, we use the test split of the KITTI raw dataset. The depth maps for KITTI raw are sampled at irregularly spaced positions, captured using a rotating LIDAR scanner. Following the standard evaluation protocol, we evaluate the performance using only the regions with ground truth depth samples (bottom parts of the images). We also evaluate the generalization of our depth network on general scenes using the Make3D dataset .
We implement our approach in TensorFlow and conduct all the experiments on a single Tesla K80 GPU with 12GB memory. We set , , and . For network training, we use the Adam optimizer  with , . In the following, we provide more implementation details in network architecture, network pre-training, and the proposed unsupervised joint training.
For the pose network, we adopt the architecture from Zhou et al. . For the depth network, we use the ResNet-50  as our feature backbone with ELU activation functions. For the flow network, we adopt the UnFlow-C structure  — a variant of FlowNetC . As our network training is model-agnostic, more advanced network architectures (e.g., pose , depth , or flow ) can be used for further improving the performance.
We train the depth and pose networks with a mini-batch size of 6 image pairs whose size is , from KITTI raw dataset or CityScapes dataset for 100K iterations. We use a learning rate is 2e-4. Each iteration takes around 0.8s (forward and backprop) during training.
Following Meister et al. , we train the flow network with a mini-batch size of 4 image pairs whose size is from SYNTHIA dataset for 300K iterations. We keep the initial learning rate as 1e-4 for the first 100K iterations and then reduce the learning rate by half after each 100K iterations. Each iteration takes around 2.4s (forward and backprop).
We jointly train the depth, pose, and flow networks with a mini-batch size of 4 image pairs from KITTI raw dataset for 100K iterations. Input size for the depth and pose networks is , while the input size for the flow network is . We divide the initial learning rate by 2 for every 20K iterations. Our depth network produces depth predictions at 4 spatial scales, while the flow network produces flow fields at 5 scales. We enforce the cross-network consistency in the finest 4 scales. Each iteration takes around 3.6s (forward and backprop) during training.
As the input size of the UnFlow-C network  must be divisible by 64, we resize input image pairs of the two KITTI flow datasets to using bilinear interpolation. We then resize the estimated optical flow and rescale the predicted flow vectors to match the original input size. For depth estimation, we resize the input image to the same size of training input to predict the disparity first. We then resize and rescale the predicted disparity to the original size and compute the inverse the obtain the final prediction.
|Error metric||Accuracy metric|
|Method||Dataset||Abs Rel||Sq Rel||RMSE||log RMSE|
|Eigen et al. ||K (D)||0.203||1.548||6.307||0.246||0.702||0.890||0.958|
|Kuznietsov et al. ||K (B) / K (D)||0.113||0.741||4.621||0.189||0.862||0.960||0.986|
|Zhan et al. ||K (B)||0.144||1.391||5.869||0.241||0.803||0.928||0.969|
|Godard et al. ||K (B)||0.133||1.140||5.527||0.229||0.830||0.936||0.970|
|Godard et al. ||CS+K (B)||0.121||1.032||5.200||0.215||0.854||0.944||0.973|
|Zhou et al. ||K (M)||0.208||1.768||6.856||0.283||0.678||0.885||0.957|
|Yang et al. ||K (M)||0.182||1.481||6.501||0.267||0.725||0.906||0.963|
|Mahjourian et al. ||K (M)||0.163||1.240||6.220||0.250||0.762||0.916||0.968|
|Yang et al. ||K (M)||0.162||1.352||6.276||0.252||-||-||-|
|Yin et al. ||K (M)||0.155||1.296||5.857||0.233||0.793||0.931||0.973|
|Godard et al. ||K (M)||0.154||1.218||5.699||0.231||0.798||0.932||0.973|
|Ours (w/o forward-backward)||K (M)||0.160||1.256||5.555||0.226||0.796||0.931||0.973|
|Ours (w/o cross-task)||K (M)||0.160||1.234||5.508||0.225||0.800||0.932||0.972|
|Zhou et al. ||CS+K (M)||0.198||1.836||6.565||0.275||0.718||0.901||0.960|
|Yang et al. ||CS+K (M)||0.165||1.360||6.641||0.248||0.750||0.914||0.969|
|Mahjourian et al. ||CS+K (M)||0.159||1.231||5.912||0.243||0.784||0.923||0.970|
|Yang et al. ||CS+K (M)||0.159||1.345||6.254||0.247||-||-||-|
|Yin et al. ||CS+K (M)||0.153||1.328||5.737||0.232||0.802||0.934||0.972|
|Ours (w/o forward-backward)||CS+K (M)||0.159||1.716||5.616||0.222||0.805||0.939||0.976|
|Ours (w/o cross-task)||CS+K (M)||0.155||1.181||5.301||0.218||0.805||0.939||0.977|
Following Zhou et al. , we evaluate our depth network using several error metrics (absolute relative difference, square related difference, RMSE, log RMSE). For optical flow estimation, we compute the average endpoint error (EPE) on pixels with the ground truth flow available for each dataset. On KITTI flow 2015 dataset , we also compute the F1 score, which is the percentage of pixels that have EPE greater than 3 pixels and 5% of the ground truth value.
We compare our depth network with state-of-the-art algorithms on the test split of the KITTI raw dataset provided by Eigen et al. . As shown in Table 1, our method achieves the state-of-the-art performance when compared with models trained with monocular video sequences. However, our method performs slightly worse than the models that exploit calibrated stereo image pairs (i.e., pose supervision) or with additional ground truth depth annotation. We believe that performance gap can be attributed to the error induced by our pose network. Extending our approach to calibrated stereo videos is an interesting future direction.
We also conduct an ablation study by removing the forward-backward consistency loss or cross-task consistency loss. In both cases our results show significant performance of degradation, highlighting the importance the proposed consistency loss. Figure 5 shows qualitative comparison with [14, 73], our method can better capture thin structure and delineate clear object contour.
To evaluate the generalization ability of our depth network on general scenes, we also apply our trained model to the Make3D dataset . Table 2 shows that our method achieves the state-of-the-art performance compared with existing unsupervised models and is competitive with respect to supervised learning models (even without fine-tuning on Make3D datasets).
|Method||Supervision||Abs Rel||Sq Rel||RMSE||log RMSE|
|Train set mean||-||0.876||12.98||12.27||0.307|
|Karsch et al. ||depth||0.428||5.079||8.389||0.149|
|Liu et al. ||depth||0.475||6.562||10.05||0.165|
|Laina et al. ||depth||0.204||1.840||5.683||0.084|
|Li et al. ||depth||0.176||-||4.260||0.069|
|Godard et al. ||pose||0.544||10.94||11.76||0.193|
|Zhou et al. ||none||0.383||5.321||10.47||0.478|
|KITTI 2012||KITTI 2015|
|FlowNetS ||C (S)||8.26||-||15.44||52.86%||-|
|FlowNetC ||C (S)||9.35||-||12.52||47.93%||-|
|SpyNet ||C (S)||9.12||-||20.56||44.78%||-|
|SemiFlowGAN ||C (S) / K (U)||7.16||-||16.02||38.77%||-|
|FlowNet2 ||C (S) + T (S)||4.09||-||10.06||30.37%||-|
|UnsupFlownet ||C (U) + K (U)||11.3||9.9||-||-||-|
|DSTFlow ||C (U)||16.98||-||24.30||52.00%||-|
|DSTFlow ||K (U)||10.43||12.4||16.79||36.00%||39.00%|
|Yin et al. ||K (U)||-||-||10.81||-||-|
|UnFlowC ||SYN (U) + K (U)||3.78||4.5||8.80||28.94%||29.46%|
|Ours (w/o forward-backward)||SYN (U) + K (U)||3.86||4.7||9.12||26.27%||26.90%|
|Ours (w/o cross-task)||SYN (U) + K (U)||4.70||5.8||8.95||28.37%||30.03%|
|Ours||SYN (U) + K (U)||3.54||4.4||8.98||26.01%||25.70%|
|FlowNet2-ft-kitti ||C (S) + T (S) + K (S)||(1.28)||1.8||(2.30)||(8.61%)||11.48%|
|UnFlowCSS-ft-kitti ||SYN (U) + K (U) + K (S)||(1.14)||1.7||(1.86)||(7.40%)||11.11%|
|UnFlowC-ft-kitti ||SYN (U) + K (U) + K (S)||(2.13)||3.0||(3.67)||(17.78%)||24.20%|
|Ours-ft-kitti||SYN (U) + K (U) + K (S)||(1.75)||3.0||(2.85)||(13.47%)||22.82%|
|Seq. 09||Seq. 10|
|Zhou et al. ||0.0210.017||0.0200.015|
|Mahjourian et al. ||0.0130.010||0.0120.011|
|Yin et al. ||0.0120.007||0.0120.009|
We compare our flow network with conventional variational algorithms, supervised CNN methods, and several unsupervised CNN models on the KITTI flow 2012 and 2015 datasets. As shown in Table 3, our method achieves state-of-the-art performance on both datasets. A visual comparison can be found in Figure 6. With optional fine-tuning on available ground truth labels on the KITTI flow datasets, we show that our approach achieves competitive performance sharing similar network architectures. This suggests that our method can serve as an unsupervised pre-training technique for learning optical flow in domains where the amounts of ground truth data are scarce.
For completeness, we provide the performance evaluation of the pose network. We follow the same evaluation protocol as  and use a 5-frame based pose network. As shown in Table 4, our pose network shows competitive performance with respect to state-of-the-art visual SLAM methods or other unsupervised learning methods. We believe that a better pose network would further improve the performance of both depth or optical flow estimation.
We presented an unsupervised learning framework for both sing-view depth prediction and optical flow estimation using unlabeled video sequences. Our key technical contribution lies in the proposed cross-task consistency that couples the network training. At test time, the trained depth and flow models can be applied independently. We validate the benefits of joint training through extensive experiments on benchmark datasets. Our single-view depth prediction model compares favorably against existing unsupervised models using unstructured videos on both KITTI and Make3D datasets. Our flow estimation model achieves competitive performance with state-of-the-art approaches. By leveraging geometric constraints, our work suggests a promising future direction of advancing the state-of-the-art in multiple dense prediction tasks using unlabeled data.
This work was supported in part by NSF under Grant No. (#1755785). We thank NVIDIA Corporation for the donation of GPUs.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Horn, B.K., Schunck, B.G.: Determining optical flow. Artificial intelligence17(1-3), 185–203 (1981)
Lai, W.S., Huang, J.B., Yang, M.H.: Semi-supervised learning for optical flow with generative adversarial networks. In: NIPS (2017)
Li, B., Shen, C., Dai, Y., van den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: CVPR (2015)
Zamir, A.R., Sax, A., Shen, W., Guibas, L., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: CVPR (2018)