1 Introduction
In Video to Depth we are interested in estimating the depth of a given frame in a video sequence. This task has many important applications including autonomous navigation and video understanding in general.
3D reconstruction from video has traditionally been approached from the classical Structure from Motion (SfM) pipeline. SfM uses correspondence between nearby viewpoints to derive 3D structure, simultaneously building a 3D point cloud while estimating camera parameters [33]. This pipeline has shown impressive results on a number of tasks [1, 15] and is often paired with MultiView Stereo to build a more complete 3D representation [8, 9]. However, this pipeline is limited. Final reconstruction is only as good as correspondence, which is often inaccurate or difficult to obtain, resulting in structurally unlikely artifacts and incomplete reconstruction.
An alternative to the traditional pipeline is deep learning. Given ground truth depth, a deep network can be trained to directly predict depth from either a single image [4, 3, 23] or multiple frames [49, 41] using generic architectures consisting of standard operations such as convolution and pooling. One advantage of deep networks is that they can use monocular cues such as texture gradients and shading, as evidenced by their strong performance on depth estimation from a single image [4, 3, 23]. On the other hand, standard deep networks have been found to have trouble utilizing interframe correspondence—simply stacking multiple frames fails to outperform single image depth [49, 41].
In this work we propose a new approach, DeepV2D, that takes the best of both worlds by incorporating algorithmic elements of classic SfM into a deep network. We design a set of differentiable geometric modules based on classical SfM operations. Using these modules along with standard convolutional layers, we compose a videotodepth system that is fully differentiable and endtoend trainable.
DeepV2D is advantageous over a generic deep network because with multiview geometry hardcoded into the system, the network only needs to search for the remaining pieces. This smaller search space reduces overfitting and improves generalization. Our approach is advantageous over classic SfM because it uses learned features for correspondence and context to guide reconstruction.
Our main contribution is an endtoend differentiable architecture that fully incorporates multiview geometry in SfM—we “differentiablize“ known algorithms to recover camera motion and 3D structure from correspondence. With such algorithms embedded, the network only needs to search for correspondence. This is different from prior deeplearning based approaches [49, 42, 41, 48], which incorporate geometry to a substantially lesser extent by requiring the network to learn some known geometric operations, e.g. motion estimation from correspondence.
In particular, we introduce a novel differentiable operator, the transform layer, which uses geometry to map flow and depth into a camera pose update. In addition, we introduce a stereo network estimates depth from a collection of frames and is, for the first time, fully differentiable with respect to all inputs, including camera pose.
Our work is closely related to DeepTAM [48] which estimates depth and motion from video. DeepTAM is the first work to completely replace the components of SLAM—mapping and tracking—with fully learned networks. They produce high quality depth maps and can estimate camera trajectory in challenging environments. But there are several key differences. First, we incorporate geometry to a greater extent in our motion module. We introduce the transform layer which maps from image motion to camera motion directly, while DeepTAM requires a network to learn this mapping. Second, our work is endtoend differentiable and trained jointly whereas DeepTAM trains their modules in isolation and is not fully differentiable. Furthermore, we leverage learned features for stereo reconstruction by building a cost volume over learned features, while DeepTAM uses a fixed similiarity metric.
We evaluate DeepV2D on three separate datasets. On NYU depth [35] we substantially improve depth predictions over our monocular baseline and classical SfM results. On KITTI we outperform all other stereo based approaches and monocular depth estimation networks not trained with additional data. We evaluate against DeepTAM on the SUN3D dataset and show improved results.
2 Related Work
Structure from Motion: Beginning with early systems designed for small image collections [24, 27], Structure from Motion (SfM) has improved dramatically in regards to robustness, accuracy, and scalability. Advances have come from improved features [25, 14], optimization techniques [36], and more scalable data structures and representations [34, 11], culminating in a number of robust systems capable of largescale reconstruction task [34, 37, 45]. However, SfM is limited by the accuracy and availability of correspondence. In low texture regions, occlusions, or lighting changes SfM can produce noisy or missing reconstructions or fail to converge entirely.
Simultaneous Localization and Sapping (SLAM) jointly estimates camera motion and 3D structure from a video sequence [6, 28, 29, 30]. LSDSLAM [6] is unique in that it relies on a featureless approach to 3D reconstruction, directly estimating depth maps and camera pose by minimizing photometric error.
We replace both the motion estimation and mapping components of SfM and SLAM with neural network modules. SfM and SLAM are limited by their use of handcrafted features. Our trainable modules are able learn features suited for the tasks of motion and depth estimation, while retaining the geometric principles of SfM. Unlike SfM and SLAM, we can learn priors over 3D structure from large RGBD datasets. Because our network is fully differentiable, it can readily be used in conjunction with standard deep learning components. We avoid the need for handengineered iterative reweighting schemes to deal with occlusions or moving objects and instead let the network decide which image regions are important.
Single Image Depth Estimation: There has been a lot of recent interest in estimating 3D properties such as depth and surface normals from a single image. Eigen et al. [4] first showed that deep convolutional networks could be trained directly on raw pixels to estimate depth from a single image. This network was able to use monocular features alone to recover depth. Later deeper network architectures further improved performance [3, 23].
In our case, we are interested in estimating depth from a video sequence. Singleimage depth networks can be readily applied to this task, but in doing so, they are not able to use motion to guide reconstruction. Our approach retains the advantages of single image depth networks, while also being able to use of the motion parallax present in videos.
Geometry and Deep Learning: Geometric principles have been a guiding force for many deep learning architectures. Convolutional networks have been particularly successful at stereo matching [14, 26, 21]. Kendall et al. [21] built a 3D cost volume over 2D feature maps by sampling from a range of disparities. Kat et al. introduced LSM [18] and showed that similar ideas could be applied to reconstruct objects from multiple viewpoints. DeepV2D retains the geometric principles of these works, but is able to reconstruct scenes from video without known camera pose. Furthermore, unlike LSM which is limited to objects due to its choice of a Euclidean reconstruction grid, we parameterize reconstructon by camera frustum coordinates, enabling us to reconstruct challenging indoor and outdoor scenes.
Camera motion estimation with deep neural networks has generated a lot of recent interest. Kendall et. al [20, 19] focused on the problem of camera localization, while other work [49, 42, 44, 48, 41] aim to estimate camera motion between a pair of frames. These approaches all treat camera pose as a regression problem by training a network to output the parameters of the camera motion matrix. Most related to our work is DeepTAM [48] which is unique in that it estimates camera motion iteratively, where each new motion estimate use used to render the target frame onto the keyframe. This greatly improves tracking performance and generalization. Although, like previous work on deep motion estimation, DeepTAM requires a neural network to predict camera motion. These networks must learn to map image motion into a camera motion update. We propose the transform layer which translates motion estimation into a correspondence problem. Unlike prior work, our motion module only needs to learn optical flow.
Furthermore, unlike DeepTAM whose components are not fully differentiable and trained in isolation, our system is endtoend differentiable since we construct the cost volume over learned features using differentiable bilinear sampling. This allows our DeepV2D in its entirety to be jointly trained endtoend. By using learned features for reconstruction, our stereo module can learn a robust feature representation along with contextual information to facilitate matching and reconstruction.
Several geometric optimization problems have recently been formulated as differentiable network modules. Wang et al. [43] proposed a differentiable network operator which estimates camera motion by minimizing photometric error. BANet [39] applies a differentiable implementation of the LevenbergMarquet(LM) algorithm to solve for camera pose and depth jointly by minimizing reprojection error in a learned feature space. These works require optimization to be performed by photometric alignment which is often highly nonconvex [5]. Our transform layer predicts the residual term directly which results in a simplified optimization problem.
We additionally decompose reconstruction into stereo matching and motion estimation. While our final depth estimate is a product of stereo matching, BANet estimates depth as weighted combination of basis depth estimates produced by a single image depth network.
3 Approach
DeepV2D predicts depth from a video sequence. We take a collection of frames, plus a given keyframe, and predict a dense depth map. While DeepV2D predicts the depth for just a single frame from a video, it an easily be extended to output the depth for any collection of frames.
We decompose depth estimation into two separate subproblems which we solve using neural network modules. First, given our image sequence and camera motion estimates, we can reconstruct depth with stereo reconstruction. Our Stereo Module performs stereo reconstruction from a collection of images with camera motion estimates. The Stereo Module requires camera motion as an input. To this end, we estimate camera motion using our Motion Module which takes the keyframe depth as input. In the forward pass, we alternate between the stereo and motion modules as we show in 1. In this work, we assume known camera intrinsics (i.e. the camera is calibrated).
3.1 Camera Geometry and View Synthesis
As a prelimary, we define some of the operations used within the stereo and motion modules. We represent camera motion using 3D rigid body transformations. A rigid body transformation describes rotation and translation in 3D:
(1) 
Furthermore, a rigid body transform can act to transform a 3d point : .
The camera operator projects a 3D point to a pixel :
(2) 
where are the camera intrinsics. Likewise, given depth we can recover a 3D point in homogeneous coordinates using backprojection :
(3) 
With and we can define the projective warping function which maps a point with depth to a camera transformed by :
(4) 
All the equations defined are fully differentiable with respect to all inputs and can readily be used in conjunction with standard network layers.
As a final note, we can apply Equation 4 to render an entire image from a synthetic viewpoint provided the camera transform matrix and depth map . Letting be pixel in the rendered frame, we can compute its value by using Equation 4 to compute its location in the reference frame, and sample from the projected coordinate:
(5) 
Here denotes the sampling operation (note that the projected points are continuous values). We choose to use differentiable bilinear sampling which was proposed in spatial transformer networks [17]
. Differentiable bilinear sampling computes the value of a point by interpolating from its 4pixel neighbors with weights determined by proximity. By choosing differentiable bilinear sampling we can backpropogate the gradient through the entire rendering process with respect to all inputs (i.e. depth, pose, reference image).
3.2 Stereo Module
Given a set of frames and their respective estimated poses our stereo module predicts a dense depth map for keyframe image which we will denote . Each pose, represents the transformation from the keyframe camera to the camera at frame . Hence, a point in the keyframe with depth can be mapped to its location in frame using the projective warping function defined in Equation 4: .
TwoView Reconstruction: We first consider the case of two view stereo reconstruction between the image pair before showing how we can generalize to an arbitrary number of frames. Given the keyframe image and the reference image
we start by feeding each image through a convolutional neural network to generate a dense unary feature maps
and . We call this network the encoder, and the weights of the encoder are shared for each image. The purpose of the encoder is to learn a dense feature representation which will provide context and facilitate stereo matching.The stereo module constructs a cost volume from the generated feature images. The cost volume is a stack of feature maps, each rendered from the viewpoint of the keyframe camera. We enumerate over a set of hypothesis depths which span the ranges observed in the dataset. For each depth , we render the feature map from the keyframe camera using Equation 5, assuming a planar scene of depth , generating the warped feature map . We concatenate along the channel axis to form the entry in the cost volume. The final cost volume is formed by stacking the
rendered viewpoints into a single 4D tensor
. Hence, if has dimensions , then the dimension of the fully constructed cost volume will be .The cost volume is a powerful representation for stereo reconstruction and converts depth estimation into a matching problem. We perform matching and refinment with a 3D convolutional neural network. Our network consists of an encoder composed of 3x3x3 convolutional layers which subsample the spatial resolution, and a decoder which upsamples the spatial resolution. The encoder and decoder are connected with skip connections by performing elementwise addition of the feature maps. The overall architecture of the 3D matching network is similiar to the hourglass network [32] with 2D layers replaced with 3D convolutions.
Decoder: The output of the matching network is a volume of dimension
. The elements of the volume represent the likelihood of a surface. We first perform softmax over the depth dimension to convert surface liklihood to a probability of depths. In other words,
represent the probability of pixel having depth .We convert the probability volume into a single depth estimate using the differentiable argmax function [21]. A pointwise depth estimate is found by finding the expected depth—computed by taking the sum of each depth multiplied by its corresponding probability:
(6) 
Multiview Reconstruction: Our reconstruction pipeline can easily be extended to more than one keyframe image pair to improve performance. For each keyframe image pair, (, ) for , we compute the cost volume . Each volume is first processed by 4 3x3x3 convolutional layers with shared weights before a global pooling step as shown in 2. The global pooling step aggregates information across viewpoints by averaging the feature maps of the volumes.
Our stereo network shares many similarities with classical multiviewstereo pipelines which employ a cost volume to reconstruction depth. Although, our approach has two key advantages over classical techniques. First, we begin by processing the features maps through 2D CNNs to generate a dense features. Instead of using handcrafted features, the 2D network can learn feature representations which are more robust and easier to match. Furthermore, our 3D matching network is able to learn a similarity metric between feature vectors while using contextual information to refine the reconstruction. Classical work relies on much simpler priors like smoothness assumptions.
3.3 Motion Module
The input to our motion module is the keyframe image/depth pair and the video frames . The motion module estimates the motion between each keyframe image pair, for as we show in Figure 3. The final output of the motion module is the set of rigid body transformations .
The motion module considers each keyframe image pair independently, and operates in parallel for each of the pairs. For the remainder of this section, we describe the operation for the pair but keep in mind that it works the same way for each of the other pairs.
Initialization: To generate an initial motion estimate, we simply stack the frames and train a network to predict the transformation parameters . This is in line with previous work and what we refer to as the pose regression network. The results of the pose regression network are coarse and far from accurate enough for stereo reconstruction. Regardless, we found the pose regression network to be good starting point to initialize our system.
Iterative Refinement: The initial motion estimate, , is used as the starting point for further refinement. Like the stereo module, we begin by extracting a learned feature representation for the keyframe image pair to produce the feature images and . Again the weights of the feature extractor are shared across all images.
Given the keyframe depth estimate and the current motion estimate , we can render from the estimated viewpoint of the keyframe camera using Equation 5 (applying the projective warping function ) to produce the warped feature map .
The objective of the motion module is to find a transformation which aligns and . Each iteration of the motion module takes the rendered feature image pair and produces a transformation which updates the current motion estimate. As the motion estimate improves, the feature images become more and more aligned, resulting in smaller incremental updates. Instead of relying on an initial motion estimate, we can test how each motion estimate agrees with the inverse projection and propose updates to correct the error.
We use liealgebra elements to parameterize camera motion updates. A element can be mapped to with the exponential mapping [38]. The group operator is matrix multiplication. Given the vector we can perform a camera motion update on :
(7) 
Residual Motion Network: We can now describe in greater detail how the motion updates are computed. Starting with keyframe depth , feature images (), and the motion estimate , our residual motion network generates a motion update element . As shown in Figure 3, the residual motion network is applied iteratively.
In the iteration step, we estimate the optical flow between the rendered image and which we term the residual flow—denoted . To estimate flow, we concatenate the feature maps use an encoderdecoder network with skip connections modeled after FlowNetS [2]. This network is not directly supervised on flow, but instead the flow is used as an intermediate to produce the motion update.
The residual flow tells us the 2D motion between the rendered feature images. Our residual motion network produces a update to correct this motion. We propose the transform layer which is differentiable and maps the 2D motion into a 3D rigid body update. More formally, the input to this layer is the residual flow , the keyframe depth , and the camera parameters. The output is a liealgebra element which we use to perform a motion update following Equation 7.
For a given keyframe point , we can find its location in reference frame . We define to be the reprojected point under a camera transformation: . We find a motion update such that the distance between and matches the residual flow. We formalize this as the following objective function:
(8) 
Since this is a sum of squares, an update can be proposed with a GaussNewton iteration:
(9) 
where is the stack of derivatives.
Here depends only on the depth, , and the camera intrinsics; thus, the term is independent of the residual flow. Taking this term to be constant, the motion update is simply a linear mapping of the residual flow, encoding geometry and allowing the gradient to be easily backpropogated through the single GaussNewton iteration. This property means we do not need any intermediate supervision on flow.
In practice, we don’t need dense flow to compute camera motion. We train the network to output 2 additional channels constrained to with the sigmoid activation. weight the the residual in the respective x,y directions. We update Equation 8 accordingly
(10) 
Again, we provide no intermediate supervision. The network learns on its own to weight different image regions.
Deep learning has presented a new alternative for estimating camera motion. By predicting motion in the image space, our network doesn’t need to learn the geometric relationships between depth, camera motion, and optical flow. We can apply geometric knowledge by directly mapping depth and optical flow to a camera motion estimate. Furthermore, our formulation is more general than methods which predict camera parameters directly. Our minimization step takes as input the camera intrinsics, meaning that as long as the optical flow is accurate, our method can generalize to situations with different image dimensions and camera parameters.
3.4 Supervision
Depth Loss: We supervise depth on the mean absolute error between the predicted and ground truth depth over the set of pixels with valid ground truth depths denoted .
(11) 
We also include a small L1 smoothness term on the pixels where depth is missing
(12) 
giving the loss: .
Motion Loss:
We craft a loss function which avoids the need to consider rotation and translation independently and instead penalize the network on reprojection error directly. During the forward pass, our network outputs a sequence of pose estimates
for each of the images. We define the loss between two poses to be the mean huberdistance of the reprojected points with :(13) 
The reprojection error is taken to be the sum of errors between the predicted and ground truth pose over each sequence and image:
(14) 
Additionally, we want the predicted residual flow to match the motion update. We add a regularization penalty on the residual flow following each GaussNewton update by penalizing on the weighted squared error term in Equation 10. Finally, we want to avoid the degenerate case where all flow weights become 0. We take the logloss of the top K weights. The final motion loss is the combination:
(15) 
4 Experiments
We test our approach on NYUv2 [35] and the KITTI [10] datasets and compare to both classical SfM and monocular depth estimation approaches. We apply the following 2stage training approach. We perform data augmentation by adjusting brightness, gamma, and performing random scaling of the image channels.
Stage I: We train the Motion Module using the
loss with RMSProp
[40]. For the input depth, we use the ground truth depth with missing values filled in with nearest neighbor interpolation.Stage II: In stage II, we jointly train the Motion and Stereo modules endtoend on the combination of motion and depth loss terms with RMSProp. Again, DeepV2D requires an additional depth estimate. For each training instance, we choose between two options: (1) use the ground truth depth or (2) use the depth predicted last time this training instance was encountered. During training we decay the probability of ground truth initialization.
Timing Information: On the NYU dataset our system operates at 340ms per iteration for a 5 frame video with 480x640 input resolution. For 5 192x1088 frames on KITTI, DeepV2D runs at 230ms/iteration.
NYU Training: We experiment on the NYU depth dataset [35] using the standard Eigen train/test split [4]
. NYU provides a challenging benchmark to test our approach. Unlike other datasets such as KITTI where camera motion is mostly planar, NYU exhibits more complex motion which span all degrees of freedom.
We train Stage I for 50k iterations with a batch size of 4, and train Stage II for 160k iterations with a batch size of 1. We set the number of residual iterations in the motion module to again be 3. During training, we sample a set of target frames uniformly from the raw distribution. For each target frame, we sample 6 neighboring frames spaced approximately 0.25s apart. At each training iteration, we randomly sample 3 of the 6 frames. We use the full 480x640 images. NYU does not have ground truth camera pose data, but we are able to generate good estimates applying RGBD SLAM [29]. At test time, we initialize the depth estimate with a singleimage depth network [23].
NYU Results: We show some example NYU results of DeepV2D in Figure 4. We are able to add a significant level of detail over the baseline monocular network and often make large corrections. Like classical SfM, reflective surfaces are difficult to recover. Overall, DeepV2D produces accurate and detailed depth reconstructions.
Lower is better  Higher is better  

NYUv2  rel  RMSE  RMSE log  scinv  
Eigen et al [4]  0.215  0.907  0.285  0.611  0.887  0.971  
Eigen and Fergus [3]  0.158  0.641  0.214  0.148  0.769  0.950  0.988 
Laina et al [23]  0.127  0.573  0.195    0.811  0.953  0.988 
DeMon et al [41]        0.180       
Ours  0.118  0.537  0.173  0.119  0.858  0.976  0.994 
iters  views  rel  
colmap [34]  7  0.404  0.549  0.700  0.775  
DfUSMC [13]  7  0.448  0.487  0.697  0.814  
(fcrn)  0    0.125  0.843  0.963  0.991 
ours  2  2  0.115  0.869  0.970  0.991 
ours  2  3  0.097  0.905  0.981  0.994 
ours  2  5  0.089  0.920  0.984  0.996 
ours  2  7  0.088  0.922  0.984  0.996 
In Table 1 we compare to singleimage depth estimation networks including the baseline singleimage initialization FCRN [23]. For reference, we also include the networks from Eigen and Fergus [3], and DeMon [41] which uses two frames to reconstruct depth. However, DeMoN was not trained directly on NYU due to insufficient supervision. We outperform the baseline network improving the challenging metric from 0.811 to 0.858.
For a direct comparison with classical SfM, we perform median matching as done in [49] to resolve global scale ambiguity (Table 1). We gather classical SfM results with the publicly available colmap [34] and DfUSMC [13], fixing camera intrinsics to the calibrated values. Both [34] and [13] are able to generate accurate and highly detailed reconstructions on many of the test images; however, they struggle to recover low texture scenes, producing large final errors. By using learned features and structural priors, DeepV2D can circumvent many of these failure cases.
KITTI Raw  Stereo  Abs Rel  Sq Rel  RMSE  RMSE log  
Mean  0.593  0.776  0.878  0.403  5.530  8.709  0.403  
Extra Data 
Eigen et al. [4] Fine  0.702  0.890  0.958  0.203  1.548  6.307  0.282  
Goddard et al. [12] (+City Scapes)  0.861  0.949  0.976  0.114  0.898  4.935  0.206  
Kuznietzov et al. [22]  0.862  0.960  0.986  0.113  0.741  4.621  0.189  
DORN (vgg) [7]  0.915  0.980  0.993  0.081  0.376  3.056  0.132  
DORN (resnet101) [7]  0.932  0.984  0.994  0.072  0.307  2.727  0.120  
From Scratch 
Goddard et al. [12]  0.803  0.922  0.964  0.148  1.344  5.927  0.247  
Yang et al [47]  0.888  0.958  0.980  0.097  0.734  4.442  0.187  
DfUSMC [13]  Y  0.617  0.796  0.874  0.346  5.984  8.879  0.454  
Ours (no stereo)  0.831  0.942  0.977  0.135  0.949  4.932  0.210  
Ours  Y  0.923  0.970  0.987  0.091  0.582  3.644  0.154 
NYU Ablations: DeepV2D introduces an iterative method for depth estimation. In Figure 5 we look at the convergence properties of our proposed system by plotting the scale matched absolute relative error (abs rel) as a function of the number of motion/stereo module iterations. The baseline model is initialized with a single image depth estimate from [23]. We test a version where we instead initialize with a flat depth estimate of 5m (labeled w/o init). While convergence is slower, we can still recover good depth estimates with enough iterations. Additionally, we test a version of our system where we replace the transform layer with a network head which predicts the camera motion update directly, keeping all other components the same. Results show that this hurts performance, indicating that our layer is beneficial for accurate depth estimation.
We can also visualize the regions which the motion module attends to in Figure 6. The motion module predicts two residual flow weights and which reflect the confidence of the flow vector in the respective directions and learns to upweight edges and salient image regions.
KITTI: For completeness, we evaluate different variants of our proposed approach on the KITTI driving benchmark [10] and compare to singleimage and stereo based approaches. For testing we follow the Eigen train/test split proposed in [4]. The KITTI dataset contains many dynamically moving objects presenting a challenging scenario for stereopis.
We use an input sequence of 5 images for our network formed by taking the two closest frames before and after the target image. We generate ground truth depth by reprojecting 3D velodyne points onto the left color camera and resize images to 292x1088 pixels and crop the top 100 pixels for an input size of 192x1088, and ground truth motion from the gps/imu files.
We train Stage I for 20k iterations with a batch size of 4, and train Stage II for 200k iterations with a batch size of 1. We set the number of residual iterations in the motion module to be 3. At test time we bootstrap the reconstruction process by using the output from the pose regression network, which is trained with all 5 frames as input.
We provide quantitative results in Table 3. Goddard et. al [12] Yang et al [47] trains singleimage depth networks supervised on photoconsistency measure between stereo pairs. Kuznietzov et al. [22] combines photoconsistency with velodyne data. DORN [7]
is a singleimage network which is initialized with a pretrained resnet model. We outperform our classical baseline and all methods not using external data. We are competitive with DORN on the outlier robust
metric.We test a baseline model where we disable stereo cues by training the network on videos containing only the keyframe. This result shows that while our network has the capacity to estimate depth using no stereo information, stereopis greatly improves results, reducing the error by more than 50%.
In Figure 7 we provide some visualizations of DeepV2D depth predictions. We are able to recover a significant level of detail, including thin structures such as poles and trees.
SUN3D, DeepTAM Comparison: We compare the performance of our method to DeepTAM [46] on the SUN3D dataset. We follow the setup in DeepTAM where depth estimation is tested in isolation, using the camera motion provided in the dataset. For DeMon, SGM, and DTAM we report the results as provided by [48]. As shown in Table 4 show, we outperform classical mapping systems like SemiGlobal mapping [16] and DTAM [31]. DeepTAM uses 3 networks to recover depth including a singleimage refinement network. We outperform DeepTAM despite using only a single network and applying no refinement.
L1inv  L1rel  scinv  

DTAM  0.210  0.423  0.374 
SGM  0.197  0.412  0.340 
DeMoN      0.146 
DeepTAM  0.064  0.111  0.130 
Ours  0.056  0.105  0.124 
5 Conclusion
We propose DeepV2D, an endtoend differentiable system which applies both monocular and geometric cues to predict 3D structure from a video sequence. DeepV2D is built from a set of geometric modules based on classical SfM operations and applies stereopis to recover depth.
Acknowledgements We would like to thank Zhaoheng Zheng for helping with baseline experiments. This work was partially funded by the Toyota Research Institute.
References
 [1] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011.

[2]
A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van
Der Smagt, D. Cremers, and T. Brox.
Flownet: Learning optical flow with convolutional networks.
In
Proceedings of the IEEE International Conference on Computer Vision
, pages 2758–2766, 2015.  [3] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
 [4] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multiscale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
 [5] J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2018.
 [6] J. Engel, T. Schöps, and D. Cremers. Lsdslam: Largescale direct monocular slam. In European Conference on Computer Vision, pages 834–849. Springer, 2014.

[7]
H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao.
Deep ordinal regression network for monocular depth estimation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2002–2011, 2018.  [8] Y. Furukawa, C. Hernández, et al. Multiview stereo: A tutorial. Foundations and Trends® in Computer Graphics and Vision, 9(12):1–148, 2015.
 [9] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2010.
 [10] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
 [11] R. Gherardi, M. Farenzena, and A. Fusiello. Improving the efficiency of hierarchical structureandmotion. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1594–1600. IEEE, 2010.
 [12] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In CVPR, volume 2, page 7, 2017.
 [13] H. Ha, S. Im, J. Park, H.G. Jeon, and I. So Kweon. Highquality depth from uncalibrated small motion clip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5413–5421, 2016.
 [14] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patchbased matching. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3279–3286. IEEE, 2015.
 [15] J. Heinly, J. L. Schonberger, E. Dunn, and J.M. Frahm. Reconstructing the world* in six days*(as captured by the yahoo 100 million image dataset). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3287–3295, 2015.
 [16] H. Hirschmuller. Accurate and efficient stereo processing by semiglobal matching and mutual information. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 807–814. IEEE, 2005.
 [17] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
 [18] A. Kar, J. Malik, and C. Häne. Learning a multiview stereo machine. In Advances in Neural Information Processing Systems, pages 364–375, 2017.
 [19] A. Kendall and R. Cipolla. Geometric loss functions for camera pose regression with deep learning. In Proc. CVPR, volume 3, page 8, 2017.
 [20] A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for realtime 6dof camera relocalization. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 2938–2946. IEEE, 2015.
 [21] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. Endtoend learning of geometry and context for deep stereo regression. arXiv preprint arXiv:1703.04309, 2017.
 [22] Y. Kuznietsov, J. Stückler, and B. Leibe. Semisupervised deep learning for monocular depth map prediction. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6647–6655, 2017.
 [23] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016.
 [24] H. C. LonguetHiggins. A computer algorithm for reconstructing a scene from two projections. Nature, 293(5828):133–135, 1981.
 [25] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
 [26] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
 [27] R. Mohr, L. Quan, and F. Veillon. Relative 3d reconstruction using multiple uncalibrated images. The International Journal of Robotics Research, 14(6):619–632, 1995.
 [28] R. MurArtal, J. M. M. Montiel, and J. D. Tardos. Orbslam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
 [29] R. MurArtal and J. D. Tardós. Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
 [30] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in realtime. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2320–2327. IEEE, 2011.
 [31] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in realtime. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2320–2327. IEEE, 2011.
 [32] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
 [33] J. L. Schonberger and J.M. Frahm. Structurefrommotion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016.
 [34] J. L. Schonberger and J.M. Frahm. Structurefrommotion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016.
 [35] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. Computer Vision–ECCV 2012, pages 746–760, 2012.
 [36] K. N. Snavely. Scene reconstruction and visualization from internet photo collections. 2009.
 [37] N. Snavely. Scene reconstruction and visualization from internet photo collections: A survey. IPSJ Transactions on Computer Vision and Applications, 3:44–66, 2011.
 [38] H. Strasdat, J. Montiel, and A. J. Davison. Scale driftaware large scale monocular slam. Robotics: Science and Systems VI, 2, 2010.
 [39] C. Tang and P. Tan. Banet: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018.

[40]
T. Tieleman and G. Hinton.
Lecture 6.5rmsprop: Divide the gradient by a running average of its
recent magnitude.
COURSERA: Neural networks for machine learning
, 4(2):26–31, 2012.  [41] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. arXiv preprint arXiv:1612.02401, 2016.
 [42] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki. Sfmnet: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017.
 [43] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2022–2030, 2018.
 [44] S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards endtoend visual odometry with deep recurrent convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 2043–2050. IEEE, 2017.
 [45] C. Wu et al. Visualsfm: A visual structure from motion system. 2011.
 [46] J. Xiao, A. Owens, and A. Torralba. Sun3d: A database of big spaces reconstructed using sfm and object labels. In Proceedings of the IEEE International Conference on Computer Vision, pages 1625–1632, 2013.
 [47] N. Yang, R. Wang, J. Stückler, and D. Cremers. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. arXiv preprint arXiv:1807.02570, 2018.
 [48] H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. arXiv preprint arXiv:1808.01900, 2018.
 [49] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and egomotion from video. arXiv preprint arXiv:1704.07813, 2017.
Comments
There are no comments yet.