There are increasing interests of studying the structure-from-motion (SfM) problem with machine learning techniques. While earlier methods directly learn a mapping from images to depth maps and camera poses, more recent works enforce multi-view geometry through optimization embed in the learning framework. This paper presents a novel optimization method based on recurrent neural networks to further exploit the potential of neural networks in SfM. Our neural optimizer alternatively updates the depth and camera poses through iterations to minimize a feature-metric cost. Two gated recurrent units are designed to trace the historical information during the iterations. Our network works as a zeroth-order optimizer, where the computation and memory expensive cost volume or gradients are avoided. Experiments demonstrate that our recurrent optimizer effectively reduces the feature-metric cost while refining the depth and poses. Our method outperforms previous methods and is more efficient in computation and memory consumption than cost-volume-based methods. The code of our method will be made public.READ FULL TEXT VIEW PDF
In the learning to learn (L2L) framework, we cast the design of optimiza...
We present an end-to-end deep learning architecture for depth map infere...
Deep learning has shown to be effective for depth inference in multi-vie...
We present a system for keyframe-based dense camera tracking and depth m...
Adversarial attack has recently become a tremendous threat to deep learn...
Identifying university students' weaknesses results in better learning a...
6DOF camera relocalization is an important component of autonomous drivi...
Structure-from-motion (SfM) 
is a fundamental task in computer vision and essential for numerous applications such as robotics, autonomous driving, augmented reality, and 3D reconstruction. Given a sequence of images, SfM methods optimize depth maps and camera poses to recover the 3D structure of a scene. Traditional methods solve the Bundle-Adjustment (BA) problem, where the re-projection error between reprojected 3D scene points and 2D image feature points are minimized iteratively.
Recently, deep-learning-based methods have dominated most benchmarks and demonstrated advantages over traditional methods[36, 17, 32, 21, 40, 33]. Earlier learning-based methods [36, 14, 23, 26] directly regress the depth maps and camera poses from the input images, but the domain knowledge such as multi-view geometry is ignored. To combine the strength of neural networks and traditional geometric methods, more recent works formulate the geometric-based optimization as differentiable layers and embed them in a learning framework [32, 33, 45].
We follow the approach of combining neural networks and optimization methods with some novel insights. Firstly, previous methods [32, 12, 33] adopt gradient-based optimization such as Levenberg-Marquardt or Gauss-Newton methods. However, the gradients could be noisy and misleading especially for the high-dimensional optimization problem in dense depth map computation. Careful regularization such as the depth bases  or manifold embedding [6, 7] is often required. Furthermore, a multi-resolution strategy is needed to gradually compute the solution from coarse to fine. In comparison, we employ a gated recurrent neural network for optimization as inspired by  as illustrated in Figure 1. Our method does not compute gradients and works on the high resolution image directly without regularization which might limit the algorithm generalization.
Secondly, some methods [40, 33, 48, 45] build cost volumes to solve the dense depth maps. Similar cost volume is also employed in  to compute optical flow. A cost volume encodes the errors of multiple different depth values at each pixel. It evaluates the result quality within a large spatial neighborhood in the solution space in a discrete fashion. While cost volumes have been demonstrated effective in computing depth maps [43, 20, 40], they are inefficient in time and space because they exhaustively evaluate results in a large spatial neighborhood. We argue that a gated recurrent network  can minimize the feature-metric error to compute dense depth without resorting to compute such a cost volume. In particular, the gated recurrent network only looks at the result quality at the current solution (i.e. a single point in the solution space) and those of the previous iterations to update the results. In spirit, our learned optimizer is zeroth-order and exploits temporal information during iterations, while gradient based methods or cost volume based methods rely only on the spatial information. In this way, our method has the potential of better running time and memory efficiency.
In experiments, our method demonstrates better accuracy than previous methods in both indoor and outdoor data. Our method is good at dealing with small-size, thin, and distinct objects. We also show that the recurrent optimizer reduces the feature-metric cost over iterations and produces gradually improved depth maps and camera poses.
Our contributions can be summarized as follows:
1) We propose a novel zeroth-order recurrent optimizer for joint depth and camera pose optimization where gradients or cost volumes are not involved for better memory and computation efficiency.
2) The depths and poses are alternatively updated to uncouple the mutual influence by the GRU module for effective optimization.
3) Our optimizer outputs better results than previous methods in both supervised and self-supervised settings.
Deep neural networks can learn to solve the SfM problem directly from data [36, 48, 40]. With the ground-truth information, DeMoN  trains two network branches to regress structures and motions separately with an auxiliary flow prediction task to exploit feature correspondences. Some methods adopt a discrete sampling strategy to achieve high-quality depth maps [48, 33]. They generate depth hypotheses and utilize multiple images to construct a cost volume. Furthermore, the pose volume is also introduced in . They take the feature maps to build two cost volumes and employ 3D convolutions to regularize.
There are also methods to directly regress scene depth from a single input image [14, 17, 26], which is an ill-posed problem. These methods rely heavily on the data fitting of the neural networks. Therefore, their network structure and feature volumes are usually bulky, and their performance are limited in unseen scenes.
Supervised methods, nevertheless, require collecting a large number of training data with ground-truth depth and camera poses. Recently, many unsupervised works [49, 19, 28, 46, 27, 37, 42, 44, 29, 4, 21, 38]
have been proposed to train a depth and pose estimation model from only monocular RGB images. They employ the predicted depths and poses to warp the neighbor frames to the reference frame, such that a photometric constraint is created to serve as a self-supervision signal. In this case, the dynamic objects is a problem and would generate errors in the photometric loss. To address this, semantic mask and optical flow [50, 47, 5] are proposed to exclude the influence of moving objects. Another challenge is the visibility problem between different frames. To deal with this, a minimum re-projection loss are designed in [19, 21] to handle the occluded regions. Despite these efforts, there is still a gap between the self-supervised methods and the supervised methods.
Traditional computer vision methods usually formulate the tasks as optimization problems according to the first principles such as photo-consistency, multi-view geometry, etc. Inspired by this, recently many works are seeking to combine the strength of neural network and traditional optimization-based methods. There are mainly two approaches in learning to optimize. One approach [3, 2, 32, 33] employs a network to predict the inputs or parameters of an optimizer, which is implemented as some layers in a large neural network for end-to-end training. On the contrary, the other approach directly learns to update optimization variables from the data [1, 10, 16, 12, 34].
However, the first approach needs to explicitly formulate the solver and is limited to problems where the objective function can be easily defined [3, 2, 32, 33]. Furthermore, the methods in [12, 32] need to explicitly evaluate gradients of the objective function, which is hard in many problems. Besides, the methods in [33, 34] adopt cost volumes, which make the model heavy to apply.
In comparison, our method does not require gradients computation or cost volume aggregation. It only evaluates the result quality at a single point in the solution space at each step. In this sense, it is a zeroth-order optimizer embedded in a network. By accumulating temporal evidence from previous iterations, our GRU module learns to minimize the objective function. Unlike the method in  which still relies on a cost volume, our method is more computation and memory efficient. Besides, two updaters in our framework, one for depth and the other one for pose, are alternatively updated, which is inspired by the traditional bundle adjustment algorithm.
Given the reference image and neighboring images , our method outputs the depth of the reference image and the relative camera poses for images as shown in Figure 2
. Images are first fed into a shared feature extraction module to produce featuresfor each image, then a depth head and a pose head take these features in and output the initial depth map and relative poses. Finally, the initial depth map and relative poses are refined by the depth and the pose GRU-optimizers alternatively, and converge to the final depth and poses.
Similar to BA-Net , we construct a photometric cost in feature space as the energy function to minimize. This cost measures the distance between aligned feature maps. Given the depth map for the reference image and the relative camera pose of respect to , the cost is defined at each pixel in the reference image :
where is the L2 norm, and is the projection function. Thus, is the 3D point corresponding to the pixel location , and transforms 3D points from the camera space of the image to that of . Note that the feature-metric error in BA-Net  would further sum the cost over all pixels as . However, in this work, we maintain a cost map that has the same resolution with the feature map . In the following of this paper, we refer as cost map instead of feature-metric error.
When there are multiple neighboring images, we average multiple cost values as for the depth value at each pixel:
For the pose cost, we directly use on each image because the pose only associates with when the depth map is fixed in our alternative optimization.
There are two feature extraction modules. One is denoted as base feature network for extracting the aforementioned feature maps , while the other one is denoted as contextual feature network for providing the initial hidden state and the contextual feature for the GRU optimizer. We use ResNet18  as our backbone to extract features. The resolution of the feature maps is of the original input images. The feature of the reference image is used for depth branch, while the feature of the concatenated image pair is used for pose branch.
We then minimize the cost map in an iterative manner. At each iteration, the optimizer outputs an update of the depth and that of the pose . Inspired by , we utilize a gated recurrent unit to compute these updates, since a GRU can memorize the status at the previous results during the optimization and the gated activation makes the update easier to converge.
The initial depth and pose are from two simple initial networks, which are adding a depth head and a pose head upon the base feature network, respectively. The depth head is composed of two convolutional neural layers, and the pose head is plus another average pooling layer. The hidden state is initialized by the contextual feature network, with the tanh function as the activation.
We design two GRU modules, one for updating the depth and the other one for updating the camera pose. Each GRU module receives the current cost map and the current estimated variables (depth map or camera pose ) and outputs an incremental update to update the results as .
Specifically, we first project the variable and the cost into the feature space with two convolutional layers and respectively, and then concatenate , , and the image contextual feature into . Therefore, the structure inside each GRU unit is as follows:
where represents a separable convolution, is the element-wise multiplication, and
are the sigmoid and the tanh activation functions. Finally, the depth maps or the camera poses are predicted from the hidden stateby similar structures to the initial depth or the camera pose head in Sec. 3.3.1.
With this optimizer, from the initial point, the estimated depth and pose are iteratively refined as the optimization iteration proceeds. Finally they will both converge to fixed points and .
After defining the structure of the GRU unit, we update the depth map and the camera transformation alternatively in totally stages. As shown in Figure 3, at each stage, we first freeze the camera pose and update the depth map as , which is repeated by times. Then we freeze the depth map and switch to the camera pose updating, where is also repeated by times. This alternative optimization leads to more stable optimization and easier training empirically. In our experiments, is set as 3 and is set as 4 if not particularly specified.
To gain more insights into the recurrent process and demonstrate the GRU unit behaves as a recurrent optimizer, we visualize how the feature-metric error decreases over the GRU iterations in Figure 4. This figure shows that both the depths and the poses are refined step-by-step to the optimum along with a decreasing cost. Eventually, the warped neighbor image is aligned seamlessly with the reference image, and the estimated depth is close to the ground truth. This indicates that our optimizer refines the outputs by learning to minimize the feature-metric error.
When ground truth is available, we supervise the training by evaluating the depth and pose errors.
computes the L1 distance between the predicted depth map and the ground-truth depth map in each stage:
where is a discounting factor.
is defined as the following according to the ground truth depth and pose :
This loss computes the image projection of a pixel according to the estimated pose and the true pose in each stage. The distance between these two projections is defined as the pose loss, which is in the image coordinate and insensitive to different scene scales. In the experiments we find it is more effective than directly comparing the difference between and in the experiments.
Then the supervised loss is the sum of these two terms:
When ground truth is not available, we borrow the loss defined in  for self-supervised training. Specifically, the supervision signal comes from geometric constraints and is composed of two terms, a photometric loss and a smoothness loss.
measures the similarity of the reconstructed images to the reference image . Here, the reconstructed images are generated by warping the input image according to the depth and pose . This similarity is measured by the structural similarity (SSIM)  with L1 loss as
where is a weighting factor. For the fusion of multiple photometric losses, we also take the strategies defined in , which adopts a minimum fusion and masks stationary pixels.
encourages adjacent pixels to have similar depths, especially for those with a similar color:
Then the self-supervision loss is defined by a weighted sum of these two terms:
where enforces a weighted depth regularization.
|Method||Input||Supervised||GT type||Abs Rel||Sq Rel||RMSE||RMSE|
|Kuznietsov et al. ||One||✓||Improved|
|DeepV2D (2-view) ||Multi||✓||Improved|
|Method||Supervised||Abs Rel||Sq Rel||RMSE||RMSE||SI Inv||Rot (deg)||Tr (deg)||Tr (cm)||Time (s)|
|Photometric BA ||✓|
|DeepV2D (2-view) ||✓|
This section first presents the implementation details and then evaluates our supervised and self-supervised models on outdoor and indoor datasets, which are followed by the ablation studies.
Our work is implemented in Pytorch and trained on Nvidia GTX 2080 Ti GPUs. The network is optimized end-to-end with Adam (, ) and the learning rate decreases from to in epochs. For supervised training, we use the ground truth from the datasets to supervise the training with the losses described in section 3.4.1, where is set as . For self-supervised training, without any ground-truth information, geometric constraints are leveraged to provide the supervision as depicted in section 3.4.2, where is set as and is set as .
The KITTI dataset is a widely used benchmark for depth evaluation, where outdoor scenes are captured from a moving vehicle. We adopt the training/test split proposed by Eigen et al. , resulting in images for training and images for testing. There are two types of ground-truth depth. One is the original Velodyne Lidar points which are quite sparse. The other one is the improved annotated depth map, which uses five successive images to accumulate the Lidar points and stereo images to handle moving objects. For the improved depth type there are images for testing.
ScanNet  is a large indoor dataset consisting of RGB-D videos in distinct environments. The raw data is captured from a depth camera. The depth maps and camera poses are obtained from RGB-D 3D reconstruction. We use the training/testing split proposed by , where image sequences from 90 scenes are used for test. In each test sequence, there are 2 images for inference.
SUN3D  has some indoor scenes where the imperfect depths and poses are provided. RGB-D SLAM  provides high-quality camera poses obtained with an external motion-tracking system and noisy depth maps. Scenes11  is a synthetic dataset generated from ShapeNet  with perfect depths and camera poses. For these datasets, we use the data processed and split by  for a fair comparison to previous methods, resulting in image sequences for training and image sequences for testing in SUN3D, image sequences for training and image sequences for testing in RGB-D SLAM, image sequences for training and image sequences for testing in Scenes11.
For outdoor scenes, we present the results of our method and some previous methods on the KITTI dataset in Table 1. State-of-the-art single-frame depth estimation methods and deep SfM methods are listed. For a fair comparison, all SfM methods are evaluated under the two-view setting. From the results, our approach outperforms other methods by a large margin in both the supervised setting and the self-supervised setting. Also, the performance of our self-supervised model already surpasses most previous supervised methods. The qualitative results of these outdoor scenes are shown in Figure 5, from which we can see our approach estimates better depth for distant and small-size or thin objects, e.g., people, motorbike, and guidepost. Also, we predict sharper edges at object boundaries. Thin structures are usually recovered by fine updates in the last few iterations.
For indoor scenes, we evaluate our method on the ScanNet dataset in Table 2. For a fair comparison, all methods are evaluated under the two-view setting since there are only 2 images in the testing split. The results of Photometric BA and DeMoN are cited from . The results show that our model outperforms previous methods on both depth accuracy and pose accuracy. Our self-supervised model is already able to predict the results that are comparable to supervised methods, especially on the pose accuracy. Among previous methods, DeepV2D performs best but it requires pre-training a complex pose solver first. Also, the inference time of their method is much longer than ours. Even using five views their performance is still not comparable to ours, of which their depth error is . From the qualitative results shown in Figure 6, our model predicts the finer depth of the indoor objects and is robust in clutter.
We further evaluate our approach on the SUN3D, RGB-D SLAM, and Scenes11 datasets using the data pre-processed by  for a fair comparison. In the experiments, however, we find some of the picked image sequences are not suitable for training an SfM model because there is no enough overlap between neighbor images, which especially influences the prediction of the depth maps. Still, we achieve decent performance that is comparable to previous methods, especially in the pose accuracy, which is not seriously affected by the lack of overlap.
|Setting||Abs Rel||Sq Rel||RMSE||R|
To inspect the effect of our framework, we evaluate each module of our model on the KITTI dataset and present the results in Table 4.
The core module of our framework is the recurrent optimizer. To see how the recurrent module helps the optimization, we replace the GRU block with three convolutional layers. In the training, the depth error decreases to in the first few epochs but then the network diverges. We think this is because the gate mechanism not only avoids the gradient explosion but also regularizes the optimization and makes the convergence stable. Furthermore, historical information is leveraged to guide the updating to avoid diverging directions.
It is important to alternatively update the depth and pose to decouple their influence in the feature-metric error. To see how they influence each other, we train a model where the optimizer predicts the depth and pose simultaneously. The depth accuracy in this setting is almost as poor as the one without GRU. This alternation is critical, for instance, when updating the depth, the training of the depth optimizer will be confused if the pose can be changed at the same time.
One of the advantages of our method is that we do not need a heavy cost volume for optimization. Here, we also replace the feature-metric error with a cost volume as the minimization object. The cost volume is in a cascaded structure, i.e, in each stage the depth range of the volume is dynamically adjusted according to the last estimated depth. From the results shown in Table 4, the performance of using this heavy cost volume is similar to using the cost map, which proves that employing information in temporal domain can make up the lack of neighborhood information in spatial domain. Also, we test the performance of a model without the cost input. As expected, the error of the depth estimation is large since the optimizer loses the objective to minimize.
Until now, we only use 12 iterations of recurrent optimization for all experiments. We could also vary the number of iterations during inference. Here, we test different iteration numbers in the inference. Zero iteration means we do not update the initial depth and pose at all. According to the results in Table 4, our optimizer already outputs decent results after 4 iterations and predicts accurate results after 8 iterations, after which the depths are further refined with more iterations. This demonstrates that our optimizer has learned to optimize the estimation step by step. A model trained with a fixed number of iterations can be applied with more iterations in real applications to obtain finer results.
In this work, we have proposed a zeroth-order deep recurrent optimizer for addressing the structure from motion problem. Two gated recurrent units have been introduced to serve as an optimizer considering both the spatial neighbor information and the temporal historical trajectories, such that the heavy cost volume could be replaced by a lightweight cost, and the gradient is not needed to be explicitly calculated in this end-to-end optimizer. The experiments have demonstrated our approach outperforms all previous methods on both the outdoor and indoor datasets, in both the supervised and self-supervised settings, which suggests that leveraging the information in the time domain can make up the lack of information in the spatial domain for an optimization problem.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2560–2568. Cited by: §1.
Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In Proceedings of the International Conference on Computer Vision, pp. 7063–7072. Cited by: Table 1.
Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 340–349. Cited by: §2.