, video frame interpolationBao et al. (2019), view synthesis Choi et al. (2019); Yoon et al. (2020); Liu et al. (2021) and stabilization Liu et al. (2009); Lee et al. (2021). In this paper, we propose a learning-based approach to achieve the concurrent inference of scene depth and camera pose from offline videos. Our study belongs to the research track of SfM with the videos acquired. Unlike visual SLAM Engel et al. (2014); Mur-Artal et al. (2015); et al. (2017); Teed and Deng (2021) assuming streaming videos that should be estimated on-line, more information of batched frames stored in the video can be used for a globally consistency estimation.
Inferring both depths and poses for every frame is essentially a challenging chicken-and-egg problem. Traditional solutions rely on the established SfM tools (e.g., COLMAP Schonberger and Frahm (2016)) to estimate the camera trajectory and then perform multi-view stereo. However, the tools often yield incomplete depth and suffer from the robustness issue due to the fragile, sparse feature tracking. Especially in the videos containing dynamic objects, motion blur, or large texture-less regions, the estimation process usually fail early even in the pre-processing step.
In learning-based studies, many works Eigen et al. (2014); Ranftl et al. (2020); Miangoleh et al. (2021) focus on estimating the depth using a single image (i.e., single depth estimation). Though every single depth is visually plausible in 2D, it is up-to-scale or even a projective ambiguity Hartley and Zisserman (2003). Thus the individual depths obtained from sequential frames suffer easily from temporal and geometric inconsistency. To learn from videos, unsupervised (or self-supervised) methods Zhou et al. (2017); Yin and Shi (2018); Godard et al. (2019); Bian et al. (2019) are proposed. They may maintain the scale consistency Bian et al. (2019) but their capability of generalizing the pre-trained model to the testing scene is usually restricted by the training data.
To overcome these issues, test-time training video-based SfM methods are introduced Luo et al. (2020); Kopf et al. (2021); Lee et al. (2021), which optimize the depth and pose of a test video directly. The solutions provide a promising way to maintain the depth and ego-motion coherence in an input video. Luo et al. propose CVD Luo et al. (2020) to refine the single-depth network for enhancing depth consistency of a monocular video. However, it relies on the COLMAP tool in advance to obtain the camera poses and thus is still limited by the robustness issue of traditional SfM. Kopf et al. extend CVD to CVD2 Kopf et al. (2021) without requiring SfM tools. CVD2 performs simultaneous depth and pose optimization with the off-the-shelf single depth network, MiDaS Ranftl et al. (2020), as initial. An issue is that the optimization is often affected by the bias of single depths and therefore yields poor pose results. Lee et al. present Deep3D Lee et al. (2021) to learn depth and pose of the test video to solve video stabilization problem. However, the optimization still results in sub-optimal estimations for short-camera-baseline videos due to the lack of single depth prior. Moreover, CVD2 and Deep3D ensure temporal consistency in a local range, whereas neither of them consider global consistency which is crucial especially in long videos.
Global-consistency of video depth: We argue that current test-time training methods lack of the ability to achieve global consistency between the depth and pose, which is a demanding issue for accurate and long-term SfM estimation and thus they will yield drifting outcomes for the input videos. We propose GCVD, a globally consistent test-time training method for depth and pose estimation based on offline videos. GCVD has the advantage that it can achieve robust inference without relying on traditional SfM tools. Our method integrates a keyframe-based pose graph into learning to attain the global consistency. Fig. 2 illustrates its pipeline. In our method, the keyframes are extracted from the input video to compose a pose graph with sequential and associated non-sequential edges. It estimates the depth and pose of keyframes and performs pose graph optimization to fulfill the global consistency. Then, the depth and pose of the remaining frames can be estimated efficiently leveraging the keyframes. The experimental results validate the strength of GCVD (Fig. 1) that our approach outperforms the state-of-the-art approaches, CVD2 and Deep3D, on both depth and pose estimations (7-Scenes dataset Shotton et al. (2013)). Besides, it is significantly faster than CVD2 and achieves a good trade-off between efficiency and global consistency in contrast to Deep3D. The main characteristics of our GCVD include:
Global consistency: To tackle the challenging on joint depth and pose estimation based on video collections, to our knowledge, we introduce the first test-time-training method that enforces the global consistency with robustness.
Efficient global-pose-graph and optimizer: Our method requires merely the pose-only bundle adjustment on the keyframes, and can then leverage network learning for estimating the depths and poses for all frames efficiently.
Performance improvements and competitive speed: Our method outperforms SOTA on both depth and pose with over 19% improvement on 7-Scenes dataset Shotton et al. (2013), and also shows strong computational efficiency.
Traditional Approach: Traditional SfM Wu and others (2011); Moulon et al. (2016); Schonberger and Frahm (2016) jointly estimate the 3D structure and camera poses of multi-view images via bundle adjustment Agarwal et al. (2022); Kümmerle et al. (2011) on the local features. Subsequently, multi-view stereo (MVS) Furukawa and Hernández (2015) following the estimated pose obtain dense depth but often yield holes and noises in general.
In another track, traditional visual odometry (VO) and SLAM Klein and Murray (2007); Engel et al. (2014); Mur-Artal et al. (2015) usually maintain keyframes and a pose graph to perform efficient and consistent localization. Besides, bundle adjustment (BA) Triggs et al. (1999) and pose graph optimization Kümmerle et al. (2011) can be introduced to prevent drifts and enhance global consistency.
The performance of traditional methods generally rely on successful feature tracking. The sparse local features Lowe (2004); Bay et al. (2006); Rublee et al. (2011) extracted are often fragile in various challenging conditions such as homogeneous areas, motion blur, and illumination changes. Hence, they are demanding to obtain dense depth maps and non-robust enough to handle the texture-less and blur situations for the videos in the wild.
Learning Depth Only (Given Pose):Supervised learning has been widely used for the ill-posed single depth estimation problem Eigen et al. (2014); Liu et al. (2015); Eigen and Fergus (2015). Apart from acquiring real depth as groundtruth, some methods Garg et al. (2016); Godard et al. (2017); Gonzalez and Kim (2021) learn the single depth with binocular pairs. Other studies leverage synthetic data Mayer et al. (2016) or pseudo groundrtuth Chen et al. (2016); Li and Snavely (2018); Chen et al. (2020); Li et al. (2019b). MiDaS Ranftl et al. (2020) obtains relative depth of stereo pairs from large-scale and diverse 3D movies. Recently, Miangoleh et al. Miangoleh et al. (2021) integrate multi-scale depth of MiDaS to handle high-resolution images. Yin et al. Yin et al. (2021) utilize point cloud networks to solve the perspective ambiguity of a single depth. Despite the single-depth methods show visual plausibility on individual depth maps, the issue of geometrical inconsistency among multi-views is not addressed.
With known camera poses, learning-based multi-view stereo can estimate dense depth of multiple images. In Huang et al. (2018); Yao et al. (2018); Im et al. (2019), plane sweep algorithm is employed to estimate dense depth in a supervised manner. The methods in Long et al. (2021); Wimbauer et al. (2021) estimate video depth with known camera pose or obtain the pose via COLMAP.
Depth and/or Pose Supervision and Joint Estimation: Approaches of this type utilize groundtruth depths and/or known-pose views for training; the obtained joint-depth/pose estimator is then applied to the test-scene sequences of unknown poses. For depth estimation, UNet-like models are used to perform per-pixel depth regression Ummenhofer et al. (2017); et al. (2017); Zhou et al. (2018); Bloesch et al. (2018); Czarnowski et al. (2020). Deep cost volume with plane sweep structure are also adopted Wei et al. (2020); Teed and Deng (2019); Wang et al. (2021). To estimate the pose, some methods directly use networks for regression Ummenhofer et al. (2017); Zhou et al. (2018); Wei et al. (2020); Teed and Deng (2021); some others leverage the fundamental multi-view and epipolar geometry et al. (2017); Bloesch et al. (2018); Teed and Deng (2019); Czarnowski et al. (2020); Wang et al. (2021). However, acquiring the groundtruth pose or depth in real world is non-trivial (e.g., requiring LiDAR Saxena et al. (2008); Geiger et al. (2013) or RGB-D cameras Sturm et al. (2012); Nathan Silberman and Fergus (2012); Shotton et al. (2013)). The demanding ability of generalization to the testing data of unknown scenes also restricts the usage in practice.
Self-Supervision for Depth and Pose: Methods in this category learn a joint depth/pose inference network in a self-supervised (or unsupervised) manner without relying on pre-given groundtruths Zhou et al. (2017); Gordon et al. (2019); Godard et al. (2019); Chen et al. (2019); Li et al. (2019a); Bian et al. (2019); Ranjan et al. (2019); Zou et al. (2020); Jiao et al. (2021); Watson et al. (2021); Zhou et al. (2021); Jung et al. (2021). They cooperatively regulate the depths and poses on monocular videos via warping-based view synthesis from the training data. Further approaches use optical flow Yin and Shi (2018); Ranjan et al. (2019); Jiao et al. (2021) and segmentation Casser et al. (2019); Ranjan et al. (2019) to improve the performance and handle dynamic objects. Yet, in test time, most networks still perform single-depth estimation and thus yield incoherent results. The methods also suffer from the generalization issue due to the domain gap between training and testing data.
Test-Time Training Approach: To avoid the difficulty of applying the inference model learned from given environment to unseen environment, test-time-training methods have been proposed more recently, making the solutions better for practical use. Despite the speed of test-time-training is slower than pure inference, the approaches meet the requirement of offline videos that need no real-time processing. CVD Luo et al. (2020) is the first attempt toward test-time training to obtain geometrically consistent depth and pose estimation of a video. However, CVD relies on the SfM tool COLMAP which suffers from the fragile and sparse reconstruction and the computation is slow. Deep3D Lee et al. (2021) uses a network-based optimization framework to learn video depth and pose for video stabilization, but the depth performance is restricted due to the lack of single depth prior. CVD2 Kopf et al. (2021) uses a deformative mesh on single depth with a post-processing filter to show promising depth results; however, the pose performance still suffers from the bias of single depth as no fine-tuning mechanisms have been accommodated for the depth net. Besides, CVD2 requires tremendous optimization time. Its quantitative performance is only evaluated on the videos of 50 frames Wulff et al. (2012).
Current works only ensure local consistency with nearby frames, whereas the global consistency is not tackled. We introduce GCVD that is the first test-time-training method with global consistency, which can obtain more accurate and robust results. Due to the global consistency, GCVD is scalable to long videos. In our experiments, 1000-frames videos are used to validate the performance.
Given an -frame video , our goal is to estimate dense depth maps and camera poses . Our CNN-based optimization framework utilizes the learning-based SfM to jointly estimate depth and pose of each frame. Nevertheless, applying SfM to a video may suffer from various challenges such as small camera baseline motion and lacking of co-visible scenes among frames.
Proper baselines arrangement: Current test-time-training solutions (e.g., Deep3D Lee et al. (2021) and CVD2 Kopf et al. (2021)) simply take near-to-distant neighbor frames with certain frame intervals empirically selected (e.g., 1, 2, 4, 8) to ensure a coherence baseline. The strategy does not take the real disparities between the frames into account, and does not guarantee the proper baselines among the frames for SfM. In our work, we leverage the recent progress of deep optical flow estimation, and introduce dense flow-guided keyframes to perform initial optimization with adequate baseline motions. The pipeline (Fig. 2) is introduced in Pre-processing and Pose Graph Generation and Keyframe-based 3D Optimization and the network-based optimization module is elaborated in Joint Optimization of Depth and Pose.
Pre-processing and Pose Graph Generation
Videos often contain motion blurs and large view-direction changes, yielding failures of traditional sparse features on matching. We use learning-based optical flows among video frames to overcome this difficulty and obtain dense flow maps. However, in contrast to existing approaches CVD2 and Deep3D which need the computation of flows from a target frame to many other frames, our GCVD initially estimates the depth and pose of keyframes with reliable camera baselines and then enforces global optimizations later; thus, we can only take the flow of adjacent views (i.e., ) to save the computation burden.
Besides, to prevent dynamic objects from disturbing pose estimation, semantic segmentation (e.g., Mask-RCNN in Detectron2 Wu et al. (2019)) is used to obtain a binary mask that filters out the likely-dynamic pixels in each frame . We also take the single depths of MiDaS Ranftl et al. (2020) as the depth prior to regularize the optimization.
The ideas of keyframe and pose graph have been widely used in traditional SLAM Klein and Murray (2007); Engel et al. (2014); Mur-Artal et al. (2015) and SfM Schonberger and Frahm (2016); Barath et al. (2021) to reduce the computational complexity and ensure global consistency for large-scale reconstruction. Our method constructs a pose graph. Nevertheless, unlike traditional SLAM or SfM, we use learning-based approach to provide better robustness in the optimization. The pose graph has keyframes () as its vertices, and the edges of the graph include the sequential edges and the non-sequential co-visible edges, as depicted below.
Dense-flow-guided keyframe decision. To sample representative keyframes from videos, traditional solutions rely on sparse feature tracking while long feature tracks are challenging to obtain. Instead, the dense optical flow acquired in the pre-processing can provide a reliable reference for keyframe decision. Thus, the frame is chosen as a keyframe if the accumulated flow magnitude from the last selected keyframe exceeds a movement threshold .
where we set and only use the static regions for evaluation. Then, the flow of adjacent keyframes, , are established for the keyframe optimization. As we use the accumulated adjacent flows to pick just-needed frames, compared to uniform selection, a more compact and exemplary keyframe set can be built. After selecting keyframes, the sequential edges are formed by connecting the keyframes of nearby indices within a subset of in the pose graph to ensure local consistency. For those keyframe pairs with the index differences exceeding , we will consider their co-visibility of shared scenes to form the further edges of non-sequential co-visible views to enforce better the global consistency.
CNN-based keyframe association.
The keyframe pairs with high image similarities are picked and geometrically verified to serve as non-sequential co-visible edges. We leverage the deep features of the
keyframes and compute their cosine similarity one-by-one to form a similarity matrix
. For each keyframe, the feature is extracted by an ImageNet-pretrained ResNet encoderGodard et al. (2019). The output feature is then passed through a global average pooling layer and normalization. Thus, can be simply computed by the inner-product of the
normalized feature vectors. Then, the associated pairs are sampled fromby a similarity threshold (
) and max-pooling for reducing redundant pairs.
Besides, the geometric verifications are necessary to filter the noisy associated keyframe pairs. Traditional verifications utilize the inlier ratio of estimated fundamental matrix or homography via SIFT Lowe (2004) and RANSAC Fischler and Bolles (1981), while it is not reliable enough. We additionally examine the forward-backward consistency of the dense optical flow of each associated pair. Moreover, to guarantee adequate co-visible areas for optimization, the associated pairs are removed if the average flow magnitude exceeds the movement threshold of keyframe decision.
Keyframe-based 3D Optimization
The proposed pipeline performs keyframe optimization with suitable camera movement to achieve robust estimation as initials. Then, pose graph optimization is utilized to retain global consistency of keyframes’ pose estimations. Finally, the depth and pose of the remaining frames are estimated efficiently according to the pose graph obtained. In the following, we give an overview of the procedures. Details of our joint depth-and-pose deep network model is depicted in Joint Optimization of Depth and Pose.
Keyframe 3D estimation via deep model. For each keyframe , the depth and pose are learned with multiple nearby keyframes of (i.e., the sequential edges in pose graph) with a descending weight to acquire consistent results. Besides, a mini-batch of keyframes are optimized simultaneously with a GPU. Thus, the mini-batches are overlapped with the interval to ensure coherent solutions Lee et al. (2021). In our work, we set . The relative poses obtained for sequential edges are then recorded for the next pose-only bundle adjustment. Likewise, we optimize the non-sequential co-visible keyframe pairs (without decreasing weights) and obtain the their relative poses accordingly. Hence, each edge of the pose graph is set up with an initial relative-pose transformation.
Global pose graph optimization with efficiency. The pose graph optimization Kümmerle et al. (2011) is used to refine the pose estimations of keyframes from the above initialization. Note that we perform pose-only bundle adjustment for rather than depth and pose bundle adjustment for efficiency. Instead, the depths of keyframes are fine-tuned in the next step leveraging the learning models. The extensive bundle-adjustment overhead can thus be shared with deep networks that are cooperated to yield a more efficient optimization.
Non-keyframe 3D optimization and keyframe depth refinement. Besides fine-tuning the keyframe depth, the remaining non-keyframes are optimized with the fixed keyframe poses (i.e., ) over fewer iterations via the deep network. Likewise, of each frame is optimized with multiple nearby views . We then obtain the depth and pose of the entire video.
Joint Optimization of Depth and Pose
In this section, we introduce the optimization module for 3D estimation. Our approach leverages the deep networks to save the scene information in the model weights to facilitate sequential fragments optimization. The module takes a set of frame pairs as input and estimate the depths and poses simultaneously. For simplicity, we depict the module with a pair of images as input. More pairs simply use the sum of respective loses. The networks learn both depth and pose to obtain the the output , , , with the objectives designed below.
Depth and pose components. As shown in Fig. 3, the depth and pose components estimate the individual depth and 6-DoF global pose , respectively, associated with an input RGB frame . The depth component exploits a MiDaS-pretrained network Ranftl et al. (2020) with an additional layer normalization to stabilize the output scale of depth estimation. Then, a learnable mesh deformation Kopf et al. (2021) is adopted to achieve better alignments among sequential depths. Unlike CVD2 Kopf et al. (2021) that directly takes fixed single depths as initials, we use a trainable depth network that can refine the bias of the initial depth to encourage better estimations. While due to the time and space efficiency, only the last two convolutional layers of MiDaS network are used to learn a larger span of frames at a time.
. Similar to the depth component, the pose encoder is frozen with ImageNet-pretrained weights and only the decoder can be optimized. The information saved in the learned weights of the decoder can help boost the convergence for the next optimization. Moreover, the feature map is added with a positional encodingVaswani et al. (2017) which encodes the chronological order of the entire sequence to enhance the learning for the sequence.
|Method||known pose?||Depth Metrics||Pose Metrics|
|AbsRel||SqRel||RMSE||ATE||RPE Trans||RPE Rot|
|DPSNet Im et al. (2019)||✓||0.199||0.142||0.438||0.710||-||-||-|
|CNMNet Long et al. (2020)||✓||0.161||0.083||0.361||0.766||-||-||-|
|NeuralRecon Sun et al. (2021)||✓||0.155||0.104||0.347||0.820||-||-||-|
|DeepV2D Teed and Deng (2019)||0.162||0.092||0.380||0.767||0.471||1.018||60.979|
|DROID-SLAM Teed and Deng (2021)||0.209||0.132||0.462||0.665||0.463||0.928||40.143|
|Deep3D Lee et al. (2021)||0.172||0.105||0.406||0.748||0.310||0.306||8.665|
|CVD2 Kopf et al. (2021)||0.154||0.085||0.379||0.795||0.375||0.517||31.102|
Objectives. The proposed objectives include inter-frame constraints for geometrical consistency and intra-frame regularization. The inter-frame objectives ensure consistent estimations between two views. The test-time learner exploits the point transformation between via 3D projection as formulated below:
where and denote the homogeneous form of a pixel in and , respectively. stands for the camera intrinsics. Accordingly, the rigid flow is used to realize the inter-frame constraints in the following three aspects.
The photometric loss computes the appearance bias between and the synthesized (warped using ) by and structure dissimilarity Wang et al. (2004) losses:
where denotes the valid points projected from onto the image plane of and excluding likely-dynamic pixels by .
The optical flow loss measures the displacement error in the image space. Hence, the flow generated in the pre-processing is used as the supervision for the rigid flow . We examine the forward-backward consistency between the pre-processing flows and to form a binary mask , and let .
The depth consistency loss Bian et al. (2019) assesses the inconsistency between individually estimated depth and .
where is the transformed depth of using , .
Apart from inter-frame losses, the intra-frame objectives are utilized to regularize each depth , including the dynamic areas. We conduct the depth gradient loss to preserve the depth prior from pre-computed MiDaS depth . The optimization can refine the bias of initial single depth. Thus, the single depth obtained in the pre-processing is exploited to provide the supervision of depth edge (Fig. 4). Let denote the 2D gradient vector of pixel in the downsampled depth map . We measure the orientation difference of depth gradients to avoid scale difference between and .
The regularization of deformation proposed by Kopf et al. Kopf et al. (2021) maintains the spatial smoothness of learnable mesh for a flexible deformation. Likewise, we use the regularization loss to encourage smoothness in dynamic area . Finally, the total loss of a pair of is conducted as:
where the inter-frame objectives compute the bidirectional losses and the intra-frame objectives sum up the losses of individual frames. The weights , , , , are set as , respectively. Note that is used only when and are adjacent.
The approach is realized in PyTorch with Adam and g2o libraryKümmerle et al. (2011). The resolution of depth and deformation mesh are 384 and 17, respectively, following CVD2 Kopf et al. (2021) for the longer side of frame. RTX3090 GPU is used on the mini-batch size 40. We run the optimizations of sequential keyframes, non-sequential keyframes, and non-keyframes with 300, 100 and 100 iterations, respectively. We further perform flow-guided depth filter like CVD2 Kopf et al. (2021) as post-processing. The optical flow Teed and Deng (2020) is used. More details are given in the appendix.
We compare the proposed method with the SOTA test-time-training methods, CVD2 Kopf et al. (2021) and Deep3D Lee et al. (2021). CVD2 jointly optimizes pose and learnable deformation from initial MiDaS Ranftl et al. (2020) depths. Deep3D takes DepthNet and PoseNet Godard et al. (2019) and learns from ImageNet pretrained weight to acquire depth and pose. For fair comparisons, we assume an ideal camera intrinsic and re-implement Deep3D with the same resolution of depth, optical flow estimation Teed and Deng (2020) and static masks as CVD2 and ours. In addition, we compare our approach with the SOTA supervised SLAM systems Teed and Deng (2019, 2021), where the SLAM mode of DeepV2D Teed and Deng (2019) performs pose optimization in a tracking window and DROID-SLAM Teed and Deng (2021) utilizes dense bundle adjustment to achieve global consistency.
Datasets. In contrast to CVD2 conducting evaluations on synthetic video clips (with each only 50-frames long) of Sintel dataset Wulff et al. (2012), We conduct the experiments on long sequences (500 to 3000 frames) of real-world datasets.
7-Scenes RGB-D dataset Shotton et al. (2013) has 46 sequences with either 500 or 1000 frames. The indoor scenes are grabbed with a Kinect camera at size 640480.
TUM RGB-D dataset Sturm et al. (2012) is gathered by a handheld Kinect camera with more demanding cases such as texture-less area and abrupt motions. Seven representative sequences (6132965 frames) in TUM RGB-D are used for evaluation.
EuRoC dataset Burri et al. (2016) has 11 sequences (17103682 frames) filmed by a micro aerial vehicle. We demonstrate the comparisons in our appendix.
Evaluation metrics. We follow the standard depth evaluation Eigen et al. (2014) to align the scales between the estimated and groundtruth depths by median scaling. The pose evaluation uses the metric of visual odometry Sturm et al. (2012); Zhang and Scaramuzza (2018), including absolute trajectory error (ATE) and relative pose error (RPE) with 7-DoF alignment.
|Sequence||Pose Error (ATE in meters)||Depth Error (Abs Rel)|
Evaluation on 7-Scenes Shotton et al. (2013)
Table 1 shows the quantitative comparisons of our approach to the SOTA methods. Although DeepV2D Teed and Deng (2019) and DROID-SLAM Teed and Deng (2021) utilize pose graph for the pose refinement and bundle adjustment in testing, they are restricted by the generalization capability of supervised learning in different datasets. Test-time training approach Deep3D Lee et al. (2021) shows the second best pose performance; however, it results in worse depth estimation. CVD2 Kopf et al. (2021) maintains more depth priors from MiDaS while the bias in the depth prior leads in poor pose estimation. Besides, we compare our GCVD with other depth-estimation approaches with known camera pose. The quantitative depth scores of DPSNet Im et al. (2019), CNMNet Long et al. (2020), and NeuralRecon Sun et al. (2021) are provided by Sun et al. (2021). Similar to DeepV2D and DROID-SLAM, the supervised methods with known pose may still suffer from the domain discrepancy between training and test data. In contrast, the proposed GCVD achieves globally consistent optimization on test data and demonstrates the best performance on both standard depth and pose metrics.
We conduct visual comparisons by displaying the back-projected point clouds and the pose estimation with the groundtruth trajectory by 7-DoF alignment Zhang and Scaramuzza (2018). As shown in Fig. 5, although Deep3D demonstrates some promising camera estimations, it is prone to collapse on the depth estimation. CVD2 provides plausible depths with the aids of MiDaS while yields weak pose results due to the lack of global consistency. Again, our method reveals better results in 3D visualization.
Evaluation on TUM-RGBD Sturm et al. (2012)
In this experiment, we compare our GCVD with the test-time training approaches (CVD2 and Deep3D) and also COLMAP which is a traditional representative solution having maintaining global consistency. Table 2 provides the per-sequence quantitative errors. We only present the pose results of COLMAP since its dense depths via multi-view stereo still contain holes and noises. Although COLMAP shows superior pose results, it completely fails (collapses) in a sequence (marked as red in Table 2) due to the fragile sparse reconstruction. As for test-time training, our method attains the lowest depth and pose errors in overall. We handle depth prior properly with the learnable networks to facilitate pose learning. Thus, our approach shows better pose estimation on most sequences compared with CVD2, which accumulates more pose errors for long sequences. Our GCVD can refine the long pose estimation for maintaining global consistency. On the other hand, we find the weak pose performance of some sequences affected by the unreliable pre-computed optical flow, which will be discussed in the limitations of our approach in the appendix.
Besides, we compare the geometric consistency by visualizing the alignment of point clouds from different views. The two distant frames viewing the common scenes are back-projected the 3D point clouds. Hence, the geometric inconsistency can be seen by the misalignment between the two point clouds in the co-visible area. As shown in Fig. 6, Deep3D and ours show similar point cloud alignments to the groundtruth’s because both approaches have the depth fine-tuning mechanisms. Nevertheless, the accumulated depth bias leads CVD2 to yield poor pose estimation and large misalignment between the views.
We conduct ablation studies on 7-Scenes dataset Shotton et al. (2013) in Table 3. For the depth component, the inserted layer normalization stabilizes the depth scale of MiDaS Ranftl et al. (2020) network and hence facilitates the depth and pose performance by 35% and 40%, respectively. The flexible deformation Kopf et al. (2021) regulates the spatial misalignments of each depth to improve depth estimation by 4%. Besides, the depth gradient loss retains the initial depth priors during refining the depth bias. The well-handled depth prior can encourage a better joint optimization, thus reducing the pose errors by 15% (0.327 to 0.277). Moreover, the method without the keyframe strategy raises 11% pose error (0.277 to 0.308) due to the lack of proper camera baselines for initial optimization. Finally, the pose graph optimization for global consistency further cuts down the pose error by 10% (0.277 to 0.249).
We compare the runtime of our GCVD with the test-time-training methods. We select five long videos and extract the first frames of the videos to compose different lengths of sequences for runtime evaluation. The execution times of the videos are measured on an i7-11700K with a RTX3090 GPU. We present the averaged per-frame time in Table 4. Note that we do not compare with COLMAP which requires extremely expensive time (e.g., 56 secs for each frame on a 2000-frame video). CVD2 shows about 6 to 9 times slower than our method due to the preparation of multiple pairs of optical flow and the traditional optimizer with CPU. Deep3D provides strong efficiency in long videos; however, it tends to yield drifts and collapse in depth estimation. In contrast, our method performs fastest in short videos by learning few keyframes first then optimizing the non-keyframes with fewer iterations. For long videos, our GCVD is slightly slower than Deep3D due to keyframe association on more keyframes for global consistency.
We present GCVD, a learning-based method for video depth and pose estimation with global consistency and efficiency. To our knowledge, this is the first study tackling global consistency for test-time training. Based on the global poses of keyframes from the pose-only bundle adjustment, the deep networks jointly learn keyframe depth refinement and the depth and pose of the remaining frames efficiently. In addition, our proposed method can better handle single depth prior properly and fine-tune the depth network to alleviate depth bias and achieve robust and consistent 3D estimation. Experimental results show that GCVD outperforms the state-of-the-art approaches on both depth and pose evaluation. Moreover, GCVD achieves high efficiency by keeping the scene knowledge in network weights to boost the optimization of next fragment of frames. We will release our codes to public. In contrast to COLMAP that uses traditional techniques, our GCVD is a fundamental deep-learning tool for the offline-video SfM.
Appendix A Appendix
We present GCVD, a test-time training method for video-based 3D estimation based on offline videos. There are still few test-time training studies for video-based SfM. Existing approaches are, however, robust to only local consistency for video depth estimation. Our approach is the first global-consistency solution to this direction. It needs only light-weight pose-only bundle adjustment as initial, and then takes advantage of neural-networks learning for global optimization of poses and depths simultaneously. Our GCVD can serve as a generally useful tool for offline video-based SfM (like the renowned tool COLMAP using traditional approach), where it can provide dense 3D estimations instead of fragile or sparse 3D outputs for challenging conditions such as homogeneous areas and motion blurs. Compared to the representative test-time-training approaches (such as CVD2Kopf et al. (2021)), our approach can handle long videos at reasonable runtime. Unlike the approach of Kopf et al. (2021) that only validates the performance using 50-frames video clips, we validate the performance for 5003682 frames, which considerably boosts the validation to practically useful situations.
In this appendix, we present additional details to complement our main paper, including implementation details, comparisons with the state-of-the-art SLAM approaches, evaluation on EuRoC dataset Burri et al. (2016), runtime analysis, and limitations of our GCVD.
Keyframe-based pose graph optimization. We perform pose graph optimization with g2o Kümmerle et al. (2011) for globally consistent pose estimation. The edges of the pose graph include sequential and non-sequential edges. Each sequential edge of the pose graph connects the keyframe pair with the relative pose and the weight matrix to ensure temporal coherence in a local range. The sequential edges contain relatively nearby views determined by the optical-flows. However, in addition to the sequential edges, there could be farther views which share co-visible scenes. Hence, we also establish the non-sequential edges via keyframe association, which connects the keyframe pairs with the index differences exceeding but sharing a co-visible scene to enhance global consistency. Similarly, the non-sequential co-visible edges are constructed with the optimized relative poses and identity weight matrices. The pose graph optimization is conducted with at most 100 iterations to obtain global pose estimation . Afterward, the depth and pose of the remaining frames and the depth of keyframes are estimated simultaneously with the frozen keyframe poses to maintain global consistency. Figure 7 demonstrates the effectiveness of the pose graph optimization for achieving the global consistency.
Detail of joint depth and pose optimization. The test-time optimization framework is implemented in PyTorh with Adam (with , ). The learning rates for sequential keyframes, non-sequential keyframes, and non-keyframes are , , and , respectively. To speed up the optimization, we compute the loss with a quarter scale of depth estimation in the keyframe optimization. Thus, the depth of keyframes will be further refined in the non-keyframe optimization with original scale (i.e., 384 for the longer side of the frame).
Post-processing. We follow the flow-guided depth filter proposed in CVD2 Kopf et al. (2021) to further enhance the temporal consistency of edge details in depth maps. The final filtered depth acquires the depth details from neighboring depths () with the chaining optical flow .
where is the projected depth of by transforming with , and the chaining flow to align the pixel coordinate. The maximum span is set as 4 by default, and the weight term considers both the depth reprojection error and the forward-backward inconsistent error of the chaining flow as follows:
where denotes the forward-backward inconsistency between the chaining flow and . The and balance the strength of the temporal depth filter. In the end, the final consistent depth and pose of the entire video is accomplished.
|Dataset||Method||Depth Metrics||Pose Metrics|
|AbsRel||SqRel||RMSE||ATE||RPE Trans||RPE Rot|
|7-Scenes||DeepV2D Teed and Deng (2019)||0.162||0.092||0.380||0.767||0.471||1.018||60.979|
|DROID-SLAM Teed and Deng (2021)||0.209||0.132||0.462||0.665||0.463||0.928||40.143|
|TUM RGBD||DeepV2D Teed and Deng (2019)||0.166||0.153||0.648||0.745||0.460||1.360||60.479|
|DROID-SLAM Teed and Deng (2021)||0.214||0.211||0.778||0.639||0.013||1.327||50.794|
Comparison with SOTA learning-based SLAMs
Although the SLAM approach reconstructs depth and pose from online (streaming) videos, which is different from our problem setting of video SfM for offline videos, we compare with the state-of-the-art supervised SLAMs Teed and Deng (2019, 2021) which tackle global consistency. DeepV2D Teed and Deng (2019) performs global pose optimization with a tracking window of eight frames. DROID-SLAM Teed and Deng (2021) utilizes dense and full bundle adjustment to achieve global consistency. Table 5 presents the quantitative comparison on 7-Scenes Shotton et al. (2013) and TUM RGBD Sturm et al. (2012) datasets. Though DROID-SLAM achieves a superior score on the pose metric of Absolute Trajectory Error (ATE) on TUM RGBD, it shows the deficiency on the other pose metric, Relative Pose Error (RPE), which is used for measuring the drift. Both DeepV2D and DROID-SLAM are supervised methods trained on other datasets with groudtruth depth or pose. They suffer from generalization ability due to domain discrepancy and thus result in weak results on 7-Scenes dataset. In contrast, our test-time training approach directly learns on the input test video to address the generalization issue and achieves the best scores on 7-Scenes.
Comparison with ORB-SLAM2 Mur-Artal et al. (2015)
We also compare GCVD with the traditional state-of-the-art SLAM approach ORB-SLAM2 Mur-Artal et al. (2015), which performs loop closure to retain globally consistent poses and 3D map. Table 6 shows the pose comparisons with traditional COLMAP Schonberger and Frahm (2016) and ORB-SLAM2 Mur-Artal et al. (2015) on TUM RGBD dataset Sturm et al. (2012). In general, ORB-SLAM2 shows the most accurate pose results with sparse hand-crafted features. Nonetheless, the sparse 3D reconstruction cannot provide complete dense depth for various video processing applications. Moreover, COLMAP and ORB-SLAM2 suffer from the robustness issues of the fragile hand-crafted features. They failed on reconstruction/tracking on one and two sequences as shown in Table 6, respectively. On the other hand, the test-time training-based approach can overcome the robustness issue and provide dense depths with dense flow. Note that our GCVD shows promising pose results close to COLMAP’s on average (excluding the failed sequences). Our GCVD thus provides a fundamental video SfM tool on dense depth reconstruction for video processing applications.
Evaluation on EuRoC Burri et al. (2016)
The challenging EuRoC dataset Burri et al. (2016) consists of 11 gray-scale sequences from a stereo camera mounted on a micro aerial vehicle in relatively large indoor environments. The groundtruth camera poses are captured by a laser tracker and motion capture system. We present the absolute trajectory error (ATE) of each sequence in Table 7 and the qualitative comparison of depth estimation in Figure 8. Our method shows the lowest pose error in average. Besides, our GCVD and CVD2 maintain the depth prior of MiDaS while Deep3D produces sub-optimal depth or collapse.
|per-frame runtime on a 50-frame video|
|pre-process.||decision||optim.||pose graph optim.||optim.|
|per-frame runtime on a 200-frame video|
|pre-process.||decision||optim.||pose graph optim.||optim.|
|per-frame runtime on a 500-frame video|
|pre-process.||decision||optim.||pose graph optim.||optim.|
|per-frame runtime on a 1000-frame video|
|pre-process.||decision||optim.||pose graph optim.||optim.|
|per-frame runtime on a 2000-frame video|
|pre-process.||decision||optim.||pose graph optim.||optim.|
We analyze the runtime of each stage in our algorithm in detail. The pipeline is divided into three main steps, pre-processing, main procedure, and post-processing for fair comparison with Deep3D Lee et al. (2021) and CVD2 Kopf et al. (2021). The averaged per-frame runtimes of videos of varying length are presented in Table 8. Note that Deep3D does not perform post-processing for video depth and pose estimation.
In the pre-processing step, our method takes fewer time since we only requires adjacent optical flow. In contrast, Deep3D and CVD2 requires multiple pairs of optical flow for a target frame (e.g., ) an thus consumes more time.
In the main procedure for joint depth and pose optimization, although Deep3D takes fewer runtime by reducing the iterations for the optimization of non-first fragments, it may lead to sub-optimal results and yield global inconsistency. On the other hand, CVD2 consumes extremely long time due to the traditional optimization with CPU. The per-frame runtime of our main procedure slightly increase with the longer length of videos mainly due to the geometric verification in keyframe association and increasing non-sequential edges for pose graph optimization. Nevertheless, we emphasize the global consistency, especially in long videos. In the last post-processing step, we re-implement the post-processing proposed by CVD2 with GPU to shorten the computational time.
In sum, the proposed method shows strong efficiency by using keyframe-based pose graph and optimization. Our globally-consistent method is slightly slower than Deep3D for the videos which is greater than 500 frames, yet is able to perform the fastest for the video less than 500 frames.
Our method can achieve globally consistent depth and pose estimation with efficient test-time training. Nevertheless, we discuss the following cases that may introduce poor performance.
Restriction by optical flow estimation. Although the dense optical flow estimation can improve the robustness of traditional sparse features, the performance of depth and pose substantially relies on the accurate optical flow. Yet, the state-of-the-art optical flow estimation could still suffer from the generalization issues and thus produce un-satisfied flow estimation in some cases. Furthermore, the forward-backward consistency check is helpful but still cannot fully guarantee the accuracy of optical flow. Hence, how to further improve the dense optical flow estimation remains a promising future direction.
Learnable camera intrinsic parameters. In this work, we assume an ideal camera intrinsic with fixed focal length to simplify the learning of scale consistency in depth and camera pose. It is still challenging on handing the videos with varying focal lengths for global consistency. Thus, we put the reconstruction of varying focal length in our future directions.
- Ceres solver. Note: http://ceres-solver.org Cited by: Related Work.
- Depth-aware video frame interpolation. In CVPR, Cited by: Introduction.
- Efficient initial pose-graph generation for global sfm. In CVPR, Cited by: Pre-processing and Pose Graph Generation.
- Surf: speeded up robust features. In ECCV, Cited by: Related Work.
- Unsupervised scale-consistent depth and ego-motion learning from monocular video. NeurIPS. Cited by: Introduction, Related Work, Joint Optimization of Depth and Pose.
- CodeSLAM—learning a compact, optimisable representation for dense visual slam. In CVPR, Cited by: Related Work.
- The euroc micro aerial vehicle datasets. The International Journal of Robotics Research (IJRR). External Links: Cited by: Figure 8, Appendix A, Appendix A, Table 7, Appendix A, Experiments.
Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In AAAI, Cited by: Related Work.
- Single-image depth perception in the wild. NeurIPS. Cited by: Related Work.
- Oasis: a large-scale dataset for single image 3d in the wild. In CVPR, Cited by: Related Work.
- Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In ICCV, Cited by: Related Work.
- Extreme view synthesis. In CVPR, Cited by: Introduction.
- Deepfactors: real-time probabilistic dense monocular slam. IEEE Robotics and Automation Letters. Cited by: Related Work.
- DepthLab: Real-Time 3D Interaction With Depth Maps for Mobile Augmented Reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, UIST. External Links: Cited by: Introduction.
- Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, Cited by: Related Work.
- Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, Cited by: Introduction, Related Work, Experiments.
- LSD-slam: large-scale direct monocular slam. In ECCV, Cited by: Introduction, Related Work, Pre-processing and Pose Graph Generation.
- Cnn-slam: real-time dense monocular slam with learned depth prediction. In CVPR, Cited by: Introduction, Related Work.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM. Cited by: Pre-processing and Pose Graph Generation.
- Multi-view stereo: a tutorial. Found. Trends. Comput. Graph. Vis.. Cited by: Related Work.
- Unsupervised cnn for single view depth estimation: geometry to the rescue. In ECCV, Cited by: Related Work.
- Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: Related Work.
- Unsupervised monocular depth estimation with left-right consistency. In CVPR, Cited by: Related Work.
- Digging into self-supervised monocular depth estimation. In ICCV, Cited by: Introduction, Related Work, Figure 3, Pre-processing and Pose Graph Generation, Joint Optimization of Depth and Pose, Experiments.
- PLADE-net: towards pixel-level accuracy for self-supervised single-view depth estimation with neural positional encoding and distilled matting loss. In CVPR, Cited by: Related Work.
- Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In ICCV, Cited by: Related Work.
Multiple view geometry in computer vision. Cambridge university press. Cited by: Introduction.
- Deep residual learning for image recognition. In CVPR, Cited by: Joint Optimization of Depth and Pose.
- Fast depth densification for occlusion-aware augmented reality. 37 (6). Cited by: Introduction.
- Deepmvs: learning multi-view stereopsis. In CVPR, Cited by: Related Work.
- Dpsnet: end-to-end deep plane sweep stereo. arXiv preprint arXiv:1905.00538. Cited by: Related Work, Table 1, Evaluation on 7-Scenes Shotton et al. (2013).
- EffiScene: efficient per-pixel rigidity inference for unsupervised joint learning of optical flow, depth, camera pose and motion segmentation. In CVPR, Cited by: Related Work.
- Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In ICCV, Cited by: Related Work.
- Parallel tracking and mapping for small ar workspaces. In 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Cited by: Related Work, Pre-processing and Pose Graph Generation.
- Robust consistent video depth estimation. In CVPR, Cited by: Figure 8, Appendix A, Appendix A, Appendix A, Figure 1, Introduction, Related Work, Figure 3, Figure 5, Joint Optimization of Depth and Pose, Joint Optimization of Depth and Pose, Implementation Detail, Table 1, Methodology, Figure 6, Evaluation on 7-Scenes Shotton et al. (2013), Ablation studies, Table 2, Experiments.
- G2o: a general framework for graph optimization. In ICRA, Cited by: Appendix A, Related Work, Related Work, Keyframe-based 3D Optimization, Implementation Detail.
- 3D video stabilization with depth estimation by cnn-based optimization. In CVPR, Cited by: Figure 8, Appendix A, Figure 1, Introduction, Introduction, Related Work, Figure 5, Keyframe-based 3D Optimization, Table 1, Methodology, Figure 6, Evaluation on 7-Scenes Shotton et al. (2013), Table 2, Experiments.
- Sequential adversarial learning for self-supervised deep visual odometry. In ICCV, Cited by: Related Work.
- Learning the depths of moving people by watching frozen people. In CVPR, Cited by: Related Work.
- Megadepth: learning single-view depth prediction from internet photos. In CVPR, Cited by: Related Work.
- Infinite nature: perpetual view generation of natural scenes from a single image. In ICCV, Cited by: Introduction.
- Learning depth from single monocular images using deep convolutional neural fields. IEEE TPAMI. Cited by: Related Work.
- Content-preserving warps for 3d video stabilization. ACM TOG. Cited by: Introduction.
- Multi-view depth estimation using epipolar spatio-temporal networks. In CVPR, Cited by: Related Work.
- Occlusion-aware depth estimation with adaptive normal constraints. In ECCV, Cited by: Table 1, Evaluation on 7-Scenes Shotton et al. (2013).
- Distinctive image features from scale-invariant keypoints. IJCV. Cited by: Related Work, Pre-processing and Pose Graph Generation.
- Consistent video depth estimation. ACM TOG. Cited by: Introduction, Related Work.
- A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, Cited by: Related Work.
- Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In CVPR, Cited by: Introduction, Related Work.
Openmvg: open multiple view geometry.
International Workshop on Reproducible Research in Pattern Recognition, Cited by: Related Work.
- ORB-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics. Cited by: Appendix A, Appendix A, Table 6, Introduction, Related Work, Pre-processing and Pose Graph Generation.
- Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: Related Work.
- Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI. Cited by: Introduction, Introduction, Related Work, Figure 3, Figure 4, Pre-processing and Pose Graph Generation, Joint Optimization of Depth and Pose, Ablation studies, Experiments.
- Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In CVPR, Cited by: Related Work.
- ORB: an efficient alternative to sift or surf. In ICCV, Cited by: Related Work.
- Make3d: learning 3d scene structure from a single still image. IEEE TPAMI. Cited by: Related Work.
- Structure-from-motion revisited. In CVPR, Cited by: Appendix A, Table 6, Introduction, Related Work, Pre-processing and Pose Graph Generation, Table 2.
- Scene coordinate regression forests for camera relocalization in rgb-d images. In CVPR, Cited by: Appendix A, Table 5, Figure 1, Introduction, Introduction, Related Work, Figure 5, Table 1, Evaluation on 7-Scenes Shotton et al. (2013), Ablation studies, Table 3, Experiments.
- A benchmark for the evaluation of rgb-d slam systems. In IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), Cited by: Figure 7, Appendix A, Appendix A, Table 5, Table 6, Related Work, Evaluation on TUM-RGBD Sturm et al. (2012), Table 2, Experiments, Experiments.
- NeuralRecon: real-time coherent 3d reconstruction from monocular video. In CVPR, Cited by: Table 1, Evaluation on 7-Scenes Shotton et al. (2013).
- Droid-slam: deep visual slam for monocular, stereo, and rgb-d cameras. NeurIPS. Cited by: Appendix A, Table 5, Introduction, Related Work, Table 1, Evaluation on 7-Scenes Shotton et al. (2013), Experiments.
- DeepV2D: video to depth with differentiable structure from motion. In International Conference on Learning Representations, Cited by: Appendix A, Table 5, Related Work, Table 1, Evaluation on 7-Scenes Shotton et al. (2013), Experiments.
- Raft: recurrent all-pairs field transforms for optical flow. In ECCV, Cited by: Implementation Detail, Experiments.
- Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, Cited by: Related Work.
- Demon: depth and motion network for learning monocular stereo. In CVPR, Cited by: Related Work.
- Attention is all you need. In NeurIPS, Cited by: Figure 3, Joint Optimization of Depth and Pose.
- Deep two-view structure-from-motion revisited. In CVPR, Cited by: Related Work.
- Image quality assessment: from error visibility to structural similarity. IEEE TIP. Cited by: Joint Optimization of Depth and Pose.
- The temporal opportunist: self-supervised multi-frame monocular depth. In CVPR, Cited by: Related Work.
- Deepsfm: structure from motion via deep bundle adjustment. In ECCV, Cited by: Related Work.
- MonoRec: semi-supervised dense reconstruction in dynamic environments from a single moving camera. In CVPR, Cited by: Related Work.
- VisualSFM: a visual structure from motion system. Cited by: Related Work.
- Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: Pre-processing and Pose Graph Generation.
- Lessons and insights from creating a synthetic optical flow benchmark. In ECCV Workshop on Unsolved Problems in Optical Flow and Stereo Estimation, Cited by: Related Work, Experiments.
- Mvsnet: depth inference for unstructured multi-view stereo. In ECCV, Cited by: Related Work.
- Learning to recover 3d scene shape from a single image. In CVPR, Cited by: Related Work.
- Geonet: unsupervised learning of dense depth, optical flow and camera pose. In CVPR, Cited by: Introduction, Related Work.
- Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In CVPR, Cited by: Introduction.
- A tutorial on quantitative trajectory evaluation for visual(-inertial) odometry. In IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), Cited by: Evaluation on 7-Scenes Shotton et al. (2013), Experiments.
- Deeptam: deep tracking and mapping. In ECCV, Cited by: Related Work.
- Unsupervised learning of depth and ego-motion from video. In CVPR, Cited by: Introduction, Related Work.
- R-msfm: recurrent multi-scale feature modulation for monocular depth estimating. In ICCV, Cited by: Related Work.
- Learning monocular visual odometry via self-supervised long-term modeling. In ECCV, Cited by: Related Work.