Log In Sign Up

DRO: Deep Recurrent Optimizer for Structure-from-Motion

by   Xiaodong Gu, et al.

There are increasing interests of studying the structure-from-motion (SfM) problem with machine learning techniques. While earlier methods directly learn a mapping from images to depth maps and camera poses, more recent works enforce multi-view geometry through optimization embed in the learning framework. This paper presents a novel optimization method based on recurrent neural networks to further exploit the potential of neural networks in SfM. Our neural optimizer alternatively updates the depth and camera poses through iterations to minimize a feature-metric cost. Two gated recurrent units are designed to trace the historical information during the iterations. Our network works as a zeroth-order optimizer, where the computation and memory expensive cost volume or gradients are avoided. Experiments demonstrate that our recurrent optimizer effectively reduces the feature-metric cost while refining the depth and poses. Our method outperforms previous methods and is more efficient in computation and memory consumption than cost-volume-based methods. The code of our method will be made public.


page 1

page 3

page 4

page 5

page 6

page 7

page 13

page 14


Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

Deep learning has recently demonstrated its excellent performance for mu...

RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Learning-based multi-view stereo (MVS) has by far centered around 3D con...

Learning to Learn by Zeroth-Order Oracle

In the learning to learn (L2L) framework, we cast the design of optimiza...

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

In this paper, we present a learning-based approach for multi-view stere...

NeW CRFs: Neural Window Fully-connected CRFs for Monocular Depth Estimation

Estimating the accurate depth from a single image is challenging since i...

Multi-View Stereo Network with attention thin volume

We propose an efficient multi-view stereo (MVS) network for infering dep...

A Multi Hidden Recurrent Neural Network with a Modified Grey Wolf Optimizer

Identifying university students' weaknesses results in better learning a...

Code Repositories

1 Introduction

Structure-from-motion (SfM) [30]

is a fundamental task in computer vision and essential for numerous applications such as robotics, autonomous driving, augmented reality, and 3D reconstruction. Given a sequence of images, SfM methods optimize depth maps and camera poses to recover the 3D structure of a scene. Traditional methods solve the Bundle-Adjustment (BA) problem 

[35], where the re-projection error between reprojected 3D scene points and 2D image feature points are minimized iteratively.

Recently, deep-learning-based methods have dominated most benchmarks and demonstrated advantages over traditional methods 

[36, 17, 32, 21, 40, 33]. Earlier learning-based methods [36, 14, 23, 26] directly regress the depth maps and camera poses from the input images, but the domain knowledge such as multi-view geometry is ignored. To combine the strength of neural networks and traditional geometric methods, more recent works formulate the geometric-based optimization as differentiable layers and embed them in a learning framework [32, 33, 45].

We follow the approach of combining neural networks and optimization methods with some novel insights. Firstly, previous methods [32, 12, 33] adopt gradient-based optimization such as Levenberg-Marquardt or Gauss-Newton methods. However, the gradients could be noisy and misleading especially for the high-dimensional optimization problem in dense depth map computation. Careful regularization such as the depth bases [32] or manifold embedding [6, 7] is often required. Furthermore, a multi-resolution strategy is needed to gradually compute the solution from coarse to fine. In comparison, we employ a gated recurrent neural network for optimization as inspired by [34] as illustrated in Figure 1. Our method does not compute gradients and works on the high resolution image directly without regularization which might limit the algorithm generalization.

Secondly, some methods [40, 33, 48, 45] build cost volumes to solve the dense depth maps. Similar cost volume is also employed in [34] to compute optical flow. A cost volume encodes the errors of multiple different depth values at each pixel. It evaluates the result quality within a large spatial neighborhood in the solution space in a discrete fashion. While cost volumes have been demonstrated effective in computing depth maps [43, 20, 40], they are inefficient in time and space because they exhaustively evaluate results in a large spatial neighborhood. We argue that a gated recurrent network [11] can minimize the feature-metric error to compute dense depth without resorting to compute such a cost volume. In particular, the gated recurrent network only looks at the result quality at the current solution (i.e. a single point in the solution space) and those of the previous iterations to update the results. In spirit, our learned optimizer is zeroth-order and exploits temporal information during iterations, while gradient based methods or cost volume based methods rely only on the spatial information. In this way, our method has the potential of better running time and memory efficiency.

Figure 2: The overview of our framework. The reference image and context image are first fed into feature networks sharing the same parameters, then the extracted features are mapped to the initial depth prediction and the initial pose prediction by the depth head and the pose head, respectively. Afterwards the optimizer begins to update the depth and pose iteratively. The current depth , current pose , and features of images are utilized to build the cost

, then they are all fed into the optimizer to output the variables increments. By iteratively adding the increments, the estimated depth and pose would progressively converge to the optimum,

and .

In experiments, our method demonstrates better accuracy than previous methods in both indoor and outdoor data. Our method is good at dealing with small-size, thin, and distinct objects. We also show that the recurrent optimizer reduces the feature-metric cost over iterations and produces gradually improved depth maps and camera poses.

Our contributions can be summarized as follows:

1) We propose a novel zeroth-order recurrent optimizer for joint depth and camera pose optimization where gradients or cost volumes are not involved for better memory and computation efficiency.

2) The depths and poses are alternatively updated to uncouple the mutual influence by the GRU module for effective optimization.

3) Our optimizer outputs better results than previous methods in both supervised and self-supervised settings.

2 Related work

Supervised Deep SfM.

Deep neural networks can learn to solve the SfM problem directly from data [36, 48, 40]. With the ground-truth information, DeMoN [36] trains two network branches to regress structures and motions separately with an auxiliary flow prediction task to exploit feature correspondences. Some methods adopt a discrete sampling strategy to achieve high-quality depth maps [48, 33]. They generate depth hypotheses and utilize multiple images to construct a cost volume. Furthermore, the pose volume is also introduced in [40]. They take the feature maps to build two cost volumes and employ 3D convolutions to regularize.

There are also methods to directly regress scene depth from a single input image [14, 17, 26], which is an ill-posed problem. These methods rely heavily on the data fitting of the neural networks. Therefore, their network structure and feature volumes are usually bulky, and their performance are limited in unseen scenes.

Self-supervised Deep SfM.

Supervised methods, nevertheless, require collecting a large number of training data with ground-truth depth and camera poses. Recently, many unsupervised works [49, 19, 28, 46, 27, 37, 42, 44, 29, 4, 21, 38]

have been proposed to train a depth and pose estimation model from only monocular RGB images. They employ the predicted depths and poses to warp the neighbor frames to the reference frame, such that a photometric constraint is created to serve as a self-supervision signal. In this case, the dynamic objects is a problem and would generate errors in the photometric loss. To address this, semantic mask 

[24] and optical flow [50, 47, 5] are proposed to exclude the influence of moving objects. Another challenge is the visibility problem between different frames. To deal with this, a minimum re-projection loss are designed in [19, 21] to handle the occluded regions. Despite these efforts, there is still a gap between the self-supervised methods and the supervised methods.

Learning to Optimize.

Traditional computer vision methods usually formulate the tasks as optimization problems according to the first principles such as photo-consistency, multi-view geometry, etc. Inspired by this, recently many works are seeking to combine the strength of neural network and traditional optimization-based methods. There are mainly two approaches in learning to optimize. One approach [3, 2, 32, 33] employs a network to predict the inputs or parameters of an optimizer, which is implemented as some layers in a large neural network for end-to-end training. On the contrary, the other approach directly learns to update optimization variables from the data [1, 10, 16, 12, 34].

However, the first approach needs to explicitly formulate the solver and is limited to problems where the objective function can be easily defined [3, 2, 32, 33]. Furthermore, the methods in [12, 32] need to explicitly evaluate gradients of the objective function, which is hard in many problems. Besides, the methods in [33, 34] adopt cost volumes, which make the model heavy to apply.

In comparison, our method does not require gradients computation or cost volume aggregation. It only evaluates the result quality at a single point in the solution space at each step. In this sense, it is a zeroth-order optimizer embedded in a network. By accumulating temporal evidence from previous iterations, our GRU module learns to minimize the objective function. Unlike the method in [34] which still relies on a cost volume, our method is more computation and memory efficient. Besides, two updaters in our framework, one for depth and the other one for pose, are alternatively updated, which is inspired by the traditional bundle adjustment algorithm.

3 Deep Recurrent Optimizer

3.1 Overview

Given the reference image and neighboring images , our method outputs the depth of the reference image and the relative camera poses for images as shown in Figure 2

. Images are first fed into a shared feature extraction module to produce features

for each image, then a depth head and a pose head take these features in and output the initial depth map and relative poses. Finally, the initial depth map and relative poses are refined by the depth and the pose GRU-optimizers alternatively, and converge to the final depth and poses.

Figure 3: Working flow of the optimizer. Updating of the depth and the pose are separated in each stage, where 4 updates of the depth are followed by 4 updates of the pose. We adopt 3 stages in our framework. For each update for the depth, the predicted depth , cost , and contextual feature map are fed in, then the update is predicted based on the inputs and historical information. Afterwards, the depth is updated by .
Depth cost
Pose cost
Depth cost heat map
Ground-truth depth
Figure 4: Visualization of the cost decreasing with regards to the stage of the GRU optimizer.x From the cost heat map, the cost of the aligned region, e.g., the desk, keeps falling. In the meantime, we display how the depth and the pose are refined gradually. The aligned image warping the neighbor image onto the reference image using the estimated depth and pose is also presented.

3.2 Feature Extraction and Cost Construction

Similar to BA-Net [32], we construct a photometric cost in feature space as the energy function to minimize. This cost measures the distance between aligned feature maps. Given the depth map for the reference image and the relative camera pose of respect to , the cost is defined at each pixel in the reference image :


where is the L2 norm, and is the projection function. Thus, is the 3D point corresponding to the pixel location , and transforms 3D points from the camera space of the image to that of . Note that the feature-metric error in BA-Net [32] would further sum the cost over all pixels as . However, in this work, we maintain a cost map that has the same resolution with the feature map . In the following of this paper, we refer as cost map instead of feature-metric error.

Depth and pose cost.

When there are multiple neighboring images, we average multiple cost values as for the depth value at each pixel:


For the pose cost, we directly use on each image because the pose only associates with when the depth map is fixed in our alternative optimization.

Feature extraction.

There are two feature extraction modules. One is denoted as base feature network for extracting the aforementioned feature maps , while the other one is denoted as contextual feature network for providing the initial hidden state and the contextual feature for the GRU optimizer. We use ResNet18 [22] as our backbone to extract features. The resolution of the feature maps is of the original input images. The feature of the reference image is used for depth branch, while the feature of the concatenated image pair is used for pose branch.

3.3 Iterative Optimization

We then minimize the cost map in an iterative manner. At each iteration, the optimizer outputs an update of the depth and that of the pose . Inspired by [34], we utilize a gated recurrent unit to compute these updates, since a GRU can memorize the status at the previous results during the optimization and the gated activation makes the update easier to converge.

3.3.1 Initialization

The initial depth and pose are from two simple initial networks, which are adding a depth head and a pose head upon the base feature network, respectively. The depth head is composed of two convolutional neural layers, and the pose head is plus another average pooling layer. The hidden state is initialized by the contextual feature network, with the tanh function as the activation.

3.3.2 Recurrent Update

We design two GRU modules, one for updating the depth and the other one for updating the camera pose. Each GRU module receives the current cost map and the current estimated variables (depth map or camera pose ) and outputs an incremental update to update the results as .

Specifically, we first project the variable and the cost into the feature space with two convolutional layers and respectively, and then concatenate , , and the image contextual feature into . Therefore, the structure inside each GRU unit is as follows:


where represents a separable convolution, is the element-wise multiplication, and

are the sigmoid and the tanh activation functions. Finally, the depth maps or the camera poses are predicted from the hidden state

by similar structures to the initial depth or the camera pose head in Sec. 3.3.1.

With this optimizer, from the initial point, the estimated depth and pose are iteratively refined as the optimization iteration proceeds. Finally they will both converge to fixed points and .

3.3.3 Alternative Optimization

After defining the structure of the GRU unit, we update the depth map and the camera transformation alternatively in totally stages. As shown in Figure 3, at each stage, we first freeze the camera pose and update the depth map as , which is repeated by times. Then we freeze the depth map and switch to the camera pose updating, where is also repeated by times. This alternative optimization leads to more stable optimization and easier training empirically. In our experiments, is set as 3 and is set as 4 if not particularly specified.

To gain more insights into the recurrent process and demonstrate the GRU unit behaves as a recurrent optimizer, we visualize how the feature-metric error decreases over the GRU iterations in Figure 4. This figure shows that both the depths and the poses are refined step-by-step to the optimum along with a decreasing cost. Eventually, the warped neighbor image is aligned seamlessly with the reference image, and the estimated depth is close to the ground truth. This indicates that our optimizer refines the outputs by learning to minimize the feature-metric error.

3.4 Training Loss

3.4.1 Supervised Case

When ground truth is available, we supervise the training by evaluating the depth and pose errors.

Depth supervision

computes the L1 distance between the predicted depth map and the ground-truth depth map in each stage:


where is a discounting factor.

Pose supervision

is defined as the following according to the ground truth depth and pose :


This loss computes the image projection of a pixel according to the estimated pose and the true pose in each stage. The distance between these two projections is defined as the pose loss, which is in the image coordinate and insensitive to different scene scales. In the experiments we find it is more effective than directly comparing the difference between and in the experiments.

Then the supervised loss is the sum of these two terms:


3.4.2 Self-supervised Case

When ground truth is not available, we borrow the loss defined in [18] for self-supervised training. Specifically, the supervision signal comes from geometric constraints and is composed of two terms, a photometric loss and a smoothness loss.

Photometric loss

measures the similarity of the reconstructed images to the reference image . Here, the reconstructed images are generated by warping the input image according to the depth and pose . This similarity is measured by the structural similarity (SSIM) [39] with L1 loss as


where is a weighting factor. For the fusion of multiple photometric losses, we also take the strategies defined in [18], which adopts a minimum fusion and masks stationary pixels.

Smoothness loss

encourages adjacent pixels to have similar depths, especially for those with a similar color:


Then the self-supervision loss is defined by a weighted sum of these two terms:


where enforces a weighted depth regularization.

Input Image
Ground truth
Figure 5: Qualitative results on the KITTI dataset.
Method Input Supervised GT type Abs Rel Sq Rel RMSE RMSE
MonoDepth [19] MO Improved
PackNet-SfM [21] MO Improved
DRO (ours) Multi Improved
Kuznietsov et al. [25] One Improved
DORN [17] One Improved
PackNet-SfM [21] MO Improved
BANet [32] Multi Improved
DeepV2D (2-view) [33] Multi Improved
DRO (ours) Multi Improved
SfMLearner [49] MO Velodyne
CCNet [29] MO Velodyne
GLNet [9] MO Velodyne
MonoDepth [19] MO Velodyne
PackNet-SfM [21] MO Velodyne
DRO (ours) Multi Velodyne
PackNet-SfM [21] MO Velodyne
DRO (ours) Multi Velodyne
Table 1: Quantitative results on the KITTI dataset. Eigen split is used for evaluation and seven widely used metrics are reported. Results on two ground-truth types are displayed since different methods are evaluated using different types. ‘MO’ means multiple images are used in the training while one image is used for inference.
Figure 6: Qualitative results on the ScanNet dataset.
Method Supervised Abs Rel Sq Rel RMSE RMSE SI Inv Rot (deg) Tr (deg) Tr (cm) Time (s)
Photometric BA [15]
DeMoN [36]
BANet [32]
DeepV2D (2-view) [33]
DRO (ours)
DRO (ours)
Table 2: Quantitative results on the ScanNet dataset. Five metrics of the depth and three metrics of the pose are reported.

4 Experiments

This section first presents the implementation details and then evaluates our supervised and self-supervised models on outdoor and indoor datasets, which are followed by the ablation studies.

4.1 Implementation Details

Our work is implemented in Pytorch and trained on Nvidia GTX 2080 Ti GPUs. The network is optimized end-to-end with Adam (

, ) and the learning rate decreases from to in epochs. For supervised training, we use the ground truth from the datasets to supervise the training with the losses described in section 3.4.1, where is set as . For self-supervised training, without any ground-truth information, geometric constraints are leveraged to provide the supervision as depicted in section 3.4.2, where is set as and is set as .

4.2 Datasets

KITTI dataset.

The KITTI dataset is a widely used benchmark for depth evaluation, where outdoor scenes are captured from a moving vehicle. We adopt the training/test split proposed by Eigen et al. [14], resulting in images for training and images for testing. There are two types of ground-truth depth. One is the original Velodyne Lidar points which are quite sparse. The other one is the improved annotated depth map, which uses five successive images to accumulate the Lidar points and stereo images to handle moving objects. For the improved depth type there are images for testing.

ScanNet dataset.

ScanNet [13] is a large indoor dataset consisting of RGB-D videos in distinct environments. The raw data is captured from a depth camera. The depth maps and camera poses are obtained from RGB-D 3D reconstruction. We use the training/testing split proposed by [32], where image sequences from 90 scenes are used for test. In each test sequence, there are 2 images for inference.

SUN3D, RGB-D SLAM, and Scenes11.

SUN3D [41] has some indoor scenes where the imperfect depths and poses are provided. RGB-D SLAM [31] provides high-quality camera poses obtained with an external motion-tracking system and noisy depth maps. Scenes11 [36] is a synthetic dataset generated from ShapeNet [8] with perfect depths and camera poses. For these datasets, we use the data processed and split by [36] for a fair comparison to previous methods, resulting in image sequences for training and image sequences for testing in SUN3D, image sequences for training and image sequences for testing in RGB-D SLAM, image sequences for training and image sequences for testing in Scenes11.

Method SUN3D RGB-D Scenes11
L1-inv Sc-inv L1-rel Rot Tran L1-inv Sc-inv L1-rel Rot Tran L1-inv Sc-inv L1-rel Rot Tran
DeMoN [36]
LS-Net [12]
BANet [32]
DeepSfM [40]
DRO (ours)
Table 3: Quantitative results on the SUN3D, RGB-D SLAM, and Scenes11 datasets.

4.3 Evaluation

Evaluation on KITTI.

For outdoor scenes, we present the results of our method and some previous methods on the KITTI dataset in Table 1. State-of-the-art single-frame depth estimation methods and deep SfM methods are listed. For a fair comparison, all SfM methods are evaluated under the two-view setting. From the results, our approach outperforms other methods by a large margin in both the supervised setting and the self-supervised setting. Also, the performance of our self-supervised model already surpasses most previous supervised methods. The qualitative results of these outdoor scenes are shown in Figure 5, from which we can see our approach estimates better depth for distant and small-size or thin objects, e.g., people, motorbike, and guidepost. Also, we predict sharper edges at object boundaries. Thin structures are usually recovered by fine updates in the last few iterations.

Evaluation on ScanNet.

For indoor scenes, we evaluate our method on the ScanNet dataset in Table 2. For a fair comparison, all methods are evaluated under the two-view setting since there are only 2 images in the testing split. The results of Photometric BA and DeMoN are cited from [32]. The results show that our model outperforms previous methods on both depth accuracy and pose accuracy. Our self-supervised model is already able to predict the results that are comparable to supervised methods, especially on the pose accuracy. Among previous methods, DeepV2D performs best but it requires pre-training a complex pose solver first. Also, the inference time of their method is much longer than ours. Even using five views their performance is still not comparable to ours, of which their depth error is . From the qualitative results shown in Figure 6, our model predicts the finer depth of the indoor objects and is robust in clutter.

Evaluation on SUN3D, RGB-D SLAM, and Scenes11.

We further evaluate our approach on the SUN3D, RGB-D SLAM, and Scenes11 datasets using the data pre-processed by [36] for a fair comparison. In the experiments, however, we find some of the picked image sequences are not suitable for training an SfM model because there is no enough overlap between neighbor images, which especially influences the prediction of the depth maps. Still, we achieve decent performance that is comparable to previous methods, especially in the pose accuracy, which is not seriously affected by the lack of overlap.

Setting Abs Rel Sq Rel RMSE R
w/o GRU
w/o Alter
w/o Cost
Cost volume
Infer iterations
Table 4: Ablation study on the KITTI dataset. The first six metrics of those used in Table 1 are reported here.

4.4 Ablation Study

To inspect the effect of our framework, we evaluate each module of our model on the KITTI dataset and present the results in Table 4.

GRU module.

The core module of our framework is the recurrent optimizer. To see how the recurrent module helps the optimization, we replace the GRU block with three convolutional layers. In the training, the depth error decreases to in the first few epochs but then the network diverges. We think this is because the gate mechanism not only avoids the gradient explosion but also regularizes the optimization and makes the convergence stable. Furthermore, historical information is leveraged to guide the updating to avoid diverging directions.

Alternant update.

It is important to alternatively update the depth and pose to decouple their influence in the feature-metric error. To see how they influence each other, we train a model where the optimizer predicts the depth and pose simultaneously. The depth accuracy in this setting is almost as poor as the one without GRU. This alternation is critical, for instance, when updating the depth, the training of the depth optimizer will be confused if the pose can be changed at the same time.

Cost Volume.

One of the advantages of our method is that we do not need a heavy cost volume for optimization. Here, we also replace the feature-metric error with a cost volume as the minimization object. The cost volume is in a cascaded structure, i.e, in each stage the depth range of the volume is dynamically adjusted according to the last estimated depth. From the results shown in Table 4, the performance of using this heavy cost volume is similar to using the cost map, which proves that employing information in temporal domain can make up the lack of neighborhood information in spatial domain. Also, we test the performance of a model without the cost input. As expected, the error of the depth estimation is large since the optimizer loses the objective to minimize.

Iteration times.

Until now, we only use 12 iterations of recurrent optimization for all experiments. We could also vary the number of iterations during inference. Here, we test different iteration numbers in the inference. Zero iteration means we do not update the initial depth and pose at all. According to the results in Table 4, our optimizer already outputs decent results after 4 iterations and predicts accurate results after 8 iterations, after which the depths are further refined with more iterations. This demonstrates that our optimizer has learned to optimize the estimation step by step. A model trained with a fixed number of iterations can be applied with more iterations in real applications to obtain finer results.

5 Conclusion

In this work, we have proposed a zeroth-order deep recurrent optimizer for addressing the structure from motion problem. Two gated recurrent units have been introduced to serve as an optimizer considering both the spatial neighbor information and the temporal historical trajectories, such that the heavy cost volume could be replaced by a lightweight cost, and the gradient is not needed to be explicitly calculated in this end-to-end optimizer. The experiments have demonstrated our approach outperforms all previous methods on both the outdoor and indoor datasets, in both the supervised and self-supervised settings, which suggests that leveraging the information in the time domain can make up the lack of information in the spatial domain for an optimization problem.


  • [1] J. Adler and O. Öktem (2017) Solving ill-posed inverse problems using iterative deep neural networks. Inverse Problems 33 (12), pp. 124007. Cited by: §2.
  • [2] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and Z. Kolter (2019) Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pp. 9558–9570. Cited by: §2, §2.
  • [3] B. Amos and J. Z. Kolter (2017) Optnet: differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pp. 136–145. Cited by: §2, §2.
  • [4] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M. Cheng, and I. Reid (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Thirty-third Conference on Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [5] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M. Cheng, and I. Reid (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Thirty-third Conference on Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [6] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison (2018) CodeSLAM—learning a compact, optimisable representation for dense visual slam. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2560–2568. Cited by: §1.
  • [7] M. Bloesch, T. Laidlow, R. Clark, S. Leutenegger, and A. J. Davison (2019) Learning meshes for dense visual slam. In Proceedings of the International Conference on Computer Vision, pp. 5855–5864. Cited by: §1.
  • [8] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §4.2.
  • [9] Y. Chen, C. Schmid, and C. Sminchisescu (2019)

    Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera

    In Proceedings of the International Conference on Computer Vision, pp. 7063–7072. Cited by: Table 1.
  • [10] Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. Freitas (2017) Learning to learn without gradient descent by gradient descent. In Proceedings of the International Conference on Machine Learning, pp. 748–756. Cited by: §2.
  • [11] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §1.
  • [12] R. Clark, M. Bloesch, J. Czarnowski, S. Leutenegger, and A. J. Davison (2018) Ls-net: learning to solve nonlinear least squares for monocular stereo. arXiv preprint arXiv:1809.02966. Cited by: §1, §2, §2, Table 3.
  • [13] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
  • [14] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems, pp. 2366–2374. Cited by: §1, §2, §4.2.
  • [15] J. Engel, T. Schöps, and D. Cremers (2014) LSD-slam: large-scale direct monocular slam. In Proceedings of the European Conference on Computer Vision, pp. 834–849. Cited by: Table 2.
  • [16] L. Fan, W. Huang, C. Gan, S. Ermon, B. Gong, and J. Huang (2018) End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6016–6025. Cited by: §2.
  • [17] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. Cited by: §1, §2, Table 1.
  • [18] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the International Conference on Computer Vision, pp. 3828–3838. Cited by: §3.4.2, §3.4.2.
  • [19] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the International Conference on Computer Vision, pp. 3828–3838. Cited by: §2, Table 1.
  • [20] X. Gu, Z. Fan, S. Zhu, Z. Dai, F. Tan, and P. Tan (2020) Cascade cost volume for high-resolution multi-view stereo and stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2495–2504. Cited by: §1.
  • [21] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon (2020) 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2485–2494. Cited by: §1, §2, Table 1.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.2.
  • [23] D. Jayaraman and K. Grauman (2015) Learning image representations tied to ego-motion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1413–1421. Cited by: §1.
  • [24] M. Klingner, J. Termöhlen, J. Mikolajczyk, and T. Fingscheidt (2020) Self-supervised monocular depth estimation: solving the dynamic object problem by semantic guidance. In Proceedings of the European Conference on Computer Vision, pp. 582–600. Cited by: §2.
  • [25] Y. Kuznietsov, J. Stückler, and B. Leibe (2017) Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2215–2223. Cited by: Table 1.
  • [26] J. H. Lee, M. Han, D. W. Ko, and I. H. Suh (2019) From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326. Cited by: §1, §2.
  • [27] R. Li, S. Wang, Z. Long, and D. Gu (2018) Undeepvo: monocular visual odometry through unsupervised deep learning. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 7286–7291. Cited by: §2.
  • [28] R. Mahjourian, M. Wicke, and A. Angelova (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675. Cited by: §2.
  • [29] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12240–12249. Cited by: §2, Table 1.
  • [30] J. L. Schonberger and J. Frahm (2016) Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113. Cited by: §1.
  • [31] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012) A benchmark for the evaluation of rgb-d slam systems. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573–580. Cited by: §4.2.
  • [32] C. Tang and P. Tan (2019) BA-net: dense bundle adjustment networks. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §1, §2, §2, §3.2, Table 1, Table 2, §4.2, §4.3, Table 3.
  • [33] Z. Teed and J. Deng (2020) Deepv2d: video to depth with differentiable structure from motion. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §1, §1, §2, §2, §2, Table 1, Table 2.
  • [34] Z. Teed and J. Deng (2020) Raft: recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision, pp. 402–419. Cited by: §1, §1, §2, §2, §2, §3.3.
  • [35] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon (1999) Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pp. 298–372. Cited by: §1.
  • [36] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox (2017) Demon: depth and motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5038–5047. Cited by: §1, §2, Table 2, §4.2, §4.3, Table 3.
  • [37] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey (2018) Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2022–2030. Cited by: §2.
  • [38] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu (2019) Unos: unified unsupervised optical-flow and stereo-depth estimation by watching videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8071–8081. Cited by: §2.
  • [39] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §3.4.2.
  • [40] X. Wei, Y. Zhang, Z. Li, Y. Fu, and X. Xue (2020) DeepSFM: structure from motion via deep bundle adjustment. In Proceedings of the European Conference on Computer Vision, pp. 230–247. Cited by: §1, §1, §2, Table 3.
  • [41] J. Xiao, A. Owens, and A. Torralba (2013) Sun3d: a database of big spaces reconstructed using sfm and object labels. In Proceedings of the International Conference on Computer Vision, pp. 1625–1632. Cited by: §4.2.
  • [42] N. Yang, R. Wang, J. Stuckler, and D. Cremers Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In Proceedings of the European Conference on Computer Vision, Cited by: §2.
  • [43] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) Mvsnet: depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision, pp. 767–783. Cited by: §1.
  • [44] Z. Yin and J. Shi (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1983–1992. Cited by: §2.
  • [45] Z. Yu and S. Gao (2020) Fast-mvsnet: sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1949–1958. Cited by: §1, §1.
  • [46] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid (2018)

    Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 340–349. Cited by: §2.
  • [47] W. Zhao, S. Liu, Y. Shu, and Y. Liu (2020) Towards better generalization: joint depth-pose learning without posenet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9151–9161. Cited by: §2.
  • [48] H. Zhou, B. Ummenhofer, and T. Brox (2018) Deeptam: deep tracking and mapping. In Proceedings of the European Conference on Computer Vision, pp. 822–838. Cited by: §1, §2.
  • [49] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §2, Table 1.
  • [50] Y. Zou, Z. Luo, and J. Huang (2018) Df-net: unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision, pp. 36–53. Cited by: §2.