SIM-Sync: From Certifiably Optimal Synchronization over the 3D Similarity Group to Scene Reconstruction with Learned Depth

by   Xihang Yu, et al.

This paper presents SIM-Sync, a certifiably optimal algorithm that estimates camera trajectory and 3D scene structure directly from multiview image keypoints. SIM-Sync fills the gap between pose graph optimization and bundle adjustment; the former admits efficient global optimization but requires relative pose measurements and the latter directly consumes image keypoints but is difficult to optimize globally (due to camera projective geometry). The bridge to this gap is a pretrained depth prediction network. Given a graph with nodes representing monocular images taken at unknown camera poses and edges containing pairwise image keypoint correspondences, SIM-Sync first uses a pretrained depth prediction network to lift the 2D keypoints into 3D scaled point clouds, where the scaling of the per-image point cloud is unknown due to the scale ambiguity in monocular depth prediction. SIM-Sync then seeks to synchronize jointly the unknown camera poses and scaling factors (i.e., over the 3D similarity group). The SIM-Sync formulation, despite nonconvex, allows designing an efficient certifiably optimal solver that is almost identical to the SE-Sync algorithm. We demonstrate the tightness, robustness, and practical usefulness of SIM-Sync in both simulated and real experiments. In simulation, we show (i) SIM-Sync compares favorably with SE-Sync in scale-free synchronization, and (ii) SIM-Sync can be used together with robust estimators to tolerate a high amount of outliers. In real experiments, we show (a) SIM-Sync achieves similar performance as Ceres on bundle adjustment datasets, and (b) SIM-Sync performs on par with ORB-SLAM3 on the TUM dataset with zero-shot depth prediction.


page 4

page 15

page 17


Learning to Recover 3D Scene Shape from a Single Image

Despite significant progress in monocular depth estimation in the wild, ...

Relative Pose from Deep Learned Depth and a Single Affine Correspondence

We propose a new approach for combining deep-learned non-metric monocula...

FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models

3D scene reconstruction is a long-standing vision task. Existing approac...

CP+: Camera Poses Augmentation with Large-scale LiDAR Maps

Large-scale colored point clouds have many advantages in navigation or s...

Learning Iterative Robust Transformation Synchronization

Transformation Synchronization is the problem of recovering absolute tra...

LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs

A critical obstacle preventing NeRF models from being deployed broadly i...

Targeted Adversarial Perturbations for Monocular Depth Prediction

We study the effect of adversarial perturbations on the task of monocula...

Please sign up or login with your details

Forgot password? Click here to reset