We propose DeepV2D, an end-to-end differentiable deep learning architecture for predicting depth from a video sequence. We incorporate elements of classical Structure from Motion into an end-to-end trainable pipeline by designing a set of differentiable geometric modules. Our full system alternates between predicting depth and refining camera pose. We estimate depth by building a cost volume over learned features and apply a multi-scale 3D convolutional network for stereo matching. The predicted depth is then sent to the motion module which performs iterative pose updates by mapping optical flow to a camera motion update. We evaluate our proposed system on NYU, KITTI, and SUN3D datasets and show improved results over monocular baselines and deep and classical stereo reconstruction.READ FULL TEXT VIEW PDF
In Video to Depth we are interested in estimating the depth of a given frame in a video sequence. This task has many important applications including autonomous navigation and video understanding in general.
3D reconstruction from video has traditionally been approached from the classical Structure from Motion (SfM) pipeline. SfM uses correspondence between nearby viewpoints to derive 3D structure, simultaneously building a 3D point cloud while estimating camera parameters . This pipeline has shown impressive results on a number of tasks [1, 15] and is often paired with Multi-View Stereo to build a more complete 3D representation [8, 9]. However, this pipeline is limited. Final reconstruction is only as good as correspondence, which is often inaccurate or difficult to obtain, resulting in structurally unlikely artifacts and incomplete reconstruction.
An alternative to the traditional pipeline is deep learning. Given ground truth depth, a deep network can be trained to directly predict depth from either a single image [4, 3, 23] or multiple frames [49, 41] using generic architectures consisting of standard operations such as convolution and pooling. One advantage of deep networks is that they can use monocular cues such as texture gradients and shading, as evidenced by their strong performance on depth estimation from a single image [4, 3, 23]. On the other hand, standard deep networks have been found to have trouble utilizing inter-frame correspondence—simply stacking multiple frames fails to outperform single image depth [49, 41].
In this work we propose a new approach, DeepV2D, that takes the best of both worlds by incorporating algorithmic elements of classic SfM into a deep network. We design a set of differentiable geometric modules based on classical SfM operations. Using these modules along with standard convolutional layers, we compose a video-to-depth system that is fully differentiable and end-to-end trainable.
DeepV2D is advantageous over a generic deep network because with multiview geometry hardcoded into the system, the network only needs to search for the remaining pieces. This smaller search space reduces overfitting and improves generalization. Our approach is advantageous over classic SfM because it uses learned features for correspondence and context to guide reconstruction.
Our main contribution is an end-to-end differentiable architecture that fully incorporates multiview geometry in SfM—we “differentiablize“ known algorithms to recover camera motion and 3D structure from correspondence. With such algorithms embedded, the network only needs to search for correspondence. This is different from prior deep-learning based approaches [49, 42, 41, 48], which incorporate geometry to a substantially lesser extent by requiring the network to learn some known geometric operations, e.g. motion estimation from correspondence.
In particular, we introduce a novel differentiable operator, the transform layer, which uses geometry to map flow and depth into a camera pose update. In addition, we introduce a stereo network estimates depth from a collection of frames and is, for the first time, fully differentiable with respect to all inputs, including camera pose.
Our work is closely related to DeepTAM  which estimates depth and motion from video. DeepTAM is the first work to completely replace the components of SLAM—mapping and tracking—with fully learned networks. They produce high quality depth maps and can estimate camera trajectory in challenging environments. But there are several key differences. First, we incorporate geometry to a greater extent in our motion module. We introduce the transform layer which maps from image motion to camera motion directly, while DeepTAM requires a network to learn this mapping. Second, our work is end-to-end differentiable and trained jointly whereas DeepTAM trains their modules in isolation and is not fully differentiable. Furthermore, we leverage learned features for stereo reconstruction by building a cost volume over learned features, while DeepTAM uses a fixed similiarity metric.
We evaluate DeepV2D on three separate datasets. On NYU depth  we substantially improve depth predictions over our monocular baseline and classical SfM results. On KITTI we outperform all other stereo based approaches and monocular depth estimation networks not trained with additional data. We evaluate against DeepTAM on the SUN3D dataset and show improved results.
Structure from Motion: Beginning with early systems designed for small image collections [24, 27], Structure from Motion (SfM) has improved dramatically in regards to robustness, accuracy, and scalability. Advances have come from improved features [25, 14], optimization techniques , and more scalable data structures and representations [34, 11], culminating in a number of robust systems capable of large-scale reconstruction task [34, 37, 45]. However, SfM is limited by the accuracy and availability of correspondence. In low texture regions, occlusions, or lighting changes SfM can produce noisy or missing reconstructions or fail to converge entirely.
Simultaneous Localization and Sapping (SLAM) jointly estimates camera motion and 3D structure from a video sequence [6, 28, 29, 30]. LSD-SLAM  is unique in that it relies on a featureless approach to 3D reconstruction, directly estimating depth maps and camera pose by minimizing photometric error.
We replace both the motion estimation and mapping components of SfM and SLAM with neural network modules. SfM and SLAM are limited by their use of hand-crafted features. Our trainable modules are able learn features suited for the tasks of motion and depth estimation, while retaining the geometric principles of SfM. Unlike SfM and SLAM, we can learn priors over 3D structure from large RGB-D datasets. Because our network is fully differentiable, it can readily be used in conjunction with standard deep learning components. We avoid the need for hand-engineered iterative re-weighting schemes to deal with occlusions or moving objects and instead let the network decide which image regions are important.
Single Image Depth Estimation: There has been a lot of recent interest in estimating 3D properties such as depth and surface normals from a single image. Eigen et al.  first showed that deep convolutional networks could be trained directly on raw pixels to estimate depth from a single image. This network was able to use monocular features alone to recover depth. Later deeper network architectures further improved performance [3, 23].
In our case, we are interested in estimating depth from a video sequence. Single-image depth networks can be readily applied to this task, but in doing so, they are not able to use motion to guide reconstruction. Our approach retains the advantages of single image depth networks, while also being able to use of the motion parallax present in videos.
Geometry and Deep Learning: Geometric principles have been a guiding force for many deep learning architectures. Convolutional networks have been particularly successful at stereo matching [14, 26, 21]. Kendall et al.  built a 3D cost volume over 2D feature maps by sampling from a range of disparities. Kat et al. introduced LSM  and showed that similar ideas could be applied to reconstruct objects from multiple viewpoints. DeepV2D retains the geometric principles of these works, but is able to reconstruct scenes from video without known camera pose. Furthermore, unlike LSM which is limited to objects due to its choice of a Euclidean reconstruction grid, we parameterize reconstructon by camera frustum coordinates, enabling us to reconstruct challenging indoor and outdoor scenes.
Camera motion estimation with deep neural networks has generated a lot of recent interest. Kendall et. al [20, 19] focused on the problem of camera localization, while other work [49, 42, 44, 48, 41] aim to estimate camera motion between a pair of frames. These approaches all treat camera pose as a regression problem by training a network to output the parameters of the camera motion matrix. Most related to our work is DeepTAM  which is unique in that it estimates camera motion iteratively, where each new motion estimate use used to render the target frame onto the keyframe. This greatly improves tracking performance and generalization. Although, like previous work on deep motion estimation, DeepTAM requires a neural network to predict camera motion. These networks must learn to map image motion into a camera motion update. We propose the transform layer which translates motion estimation into a correspondence problem. Unlike prior work, our motion module only needs to learn optical flow.
Furthermore, unlike DeepTAM whose components are not fully differentiable and trained in isolation, our system is end-to-end differentiable since we construct the cost volume over learned features using differentiable bilinear sampling. This allows our DeepV2D in its entirety to be jointly trained end-to-end. By using learned features for reconstruction, our stereo module can learn a robust feature representation along with contextual information to facilitate matching and reconstruction.
Several geometric optimization problems have recently been formulated as differentiable network modules. Wang et al.  proposed a differentiable network operator which estimates camera motion by minimizing photometric error. BA-Net  applies a differentiable implementation of the Levenberg-Marquet(LM) algorithm to solve for camera pose and depth jointly by minimizing reprojection error in a learned feature space. These works require optimization to be performed by photometric alignment which is often highly non-convex . Our transform layer predicts the residual term directly which results in a simplified optimization problem.
We additionally decompose reconstruction into stereo matching and motion estimation. While our final depth estimate is a product of stereo matching, BA-Net estimates depth as weighted combination of basis depth estimates produced by a single image depth network.
DeepV2D predicts depth from a video sequence. We take a collection of frames, plus a given keyframe, and predict a dense depth map. While DeepV2D predicts the depth for just a single frame from a video, it an easily be extended to output the depth for any collection of frames.
We decompose depth estimation into two separate subproblems which we solve using neural network modules. First, given our image sequence and camera motion estimates, we can reconstruct depth with stereo reconstruction. Our Stereo Module performs stereo reconstruction from a collection of images with camera motion estimates. The Stereo Module requires camera motion as an input. To this end, we estimate camera motion using our Motion Module which takes the keyframe depth as input. In the forward pass, we alternate between the stereo and motion modules as we show in 1. In this work, we assume known camera intrinsics (i.e. the camera is calibrated).
As a prelimary, we define some of the operations used within the stereo and motion modules. We represent camera motion using 3D rigid body transformations. A rigid body transformation describes rotation and translation in 3D:
Furthermore, a rigid body transform can act to transform a 3d point : .
The camera operator projects a 3D point to a pixel :
where are the camera intrinsics. Likewise, given depth we can recover a 3D point in homogeneous coordinates using backprojection :
With and we can define the projective warping function which maps a point with depth to a camera transformed by :
All the equations defined are fully differentiable with respect to all inputs and can readily be used in conjunction with standard network layers.
As a final note, we can apply Equation 4 to render an entire image from a synthetic viewpoint provided the camera transform matrix and depth map . Letting be pixel in the rendered frame, we can compute its value by using Equation 4 to compute its location in the reference frame, and sample from the projected coordinate:
. Differentiable bilinear sampling computes the value of a point by interpolating from its 4-pixel neighbors with weights determined by proximity. By choosing differentiable bilinear sampling we can backpropogate the gradient through the entire rendering process with respect to all inputs (i.e. depth, pose, reference image).
Given a set of frames and their respective estimated poses our stereo module predicts a dense depth map for keyframe image which we will denote . Each pose, represents the transformation from the keyframe camera to the camera at frame . Hence, a point in the keyframe with depth can be mapped to its location in frame using the projective warping function defined in Equation 4: .
Two-View Reconstruction: We first consider the case of two view stereo reconstruction between the image pair before showing how we can generalize to an arbitrary number of frames. Given the keyframe image and the reference image
we start by feeding each image through a convolutional neural network to generate a dense unary feature mapsand . We call this network the encoder, and the weights of the encoder are shared for each image. The purpose of the encoder is to learn a dense feature representation which will provide context and facilitate stereo matching.
The stereo module constructs a cost volume from the generated feature images. The cost volume is a stack of feature maps, each rendered from the viewpoint of the keyframe camera. We enumerate over a set of hypothesis depths which span the ranges observed in the dataset. For each depth , we render the feature map from the keyframe camera using Equation 5, assuming a planar scene of depth , generating the warped feature map . We concatenate along the channel axis to form the entry in the cost volume. The final cost volume is formed by stacking the
rendered viewpoints into a single 4D tensor. Hence, if has dimensions , then the dimension of the fully constructed cost volume will be .
The cost volume is a powerful representation for stereo reconstruction and converts depth estimation into a matching problem. We perform matching and refinment with a 3D convolutional neural network. Our network consists of an encoder composed of 3x3x3 convolutional layers which subsample the spatial resolution, and a decoder which up-samples the spatial resolution. The encoder and decoder are connected with skip connections by performing elementwise addition of the feature maps. The overall architecture of the 3D matching network is similiar to the hourglass network  with 2D layers replaced with 3D convolutions.
Decoder: The output of the matching network is a volume of dimension
. The elements of the volume represent the likelihood of a surface. We first perform softmax over the depth dimension to convert surface liklihood to a probability of depths. In other words,represent the probability of pixel having depth .
We convert the probability volume into a single depth estimate using the differentiable argmax function . A pointwise depth estimate is found by finding the expected depth—computed by taking the sum of each depth multiplied by its corresponding probability:
Multiview Reconstruction: Our reconstruction pipeline can easily be extended to more than one keyframe image pair to improve performance. For each keyframe image pair, (, ) for , we compute the cost volume . Each volume is first processed by 4 3x3x3 convolutional layers with shared weights before a global pooling step as shown in 2. The global pooling step aggregates information across viewpoints by averaging the feature maps of the volumes.
Our stereo network shares many similarities with classical multiview-stereo pipelines which employ a cost volume to reconstruction depth. Although, our approach has two key advantages over classical techniques. First, we begin by processing the features maps through 2D CNNs to generate a dense features. Instead of using hand-crafted features, the 2D network can learn feature representations which are more robust and easier to match. Furthermore, our 3D matching network is able to learn a similarity metric between feature vectors while using contextual information to refine the reconstruction. Classical work relies on much simpler priors like smoothness assumptions.
The input to our motion module is the keyframe image/depth pair and the video frames . The motion module estimates the motion between each keyframe image pair, for as we show in Figure 3. The final output of the motion module is the set of rigid body transformations .
The motion module considers each keyframe image pair independently, and operates in parallel for each of the pairs. For the remainder of this section, we describe the operation for the pair but keep in mind that it works the same way for each of the other pairs.
Initialization: To generate an initial motion estimate, we simply stack the frames and train a network to predict the transformation parameters . This is in line with previous work and what we refer to as the pose regression network. The results of the pose regression network are coarse and far from accurate enough for stereo reconstruction. Regardless, we found the pose regression network to be good starting point to initialize our system.
Iterative Refinement: The initial motion estimate, , is used as the starting point for further refinement. Like the stereo module, we begin by extracting a learned feature representation for the keyframe image pair to produce the feature images and . Again the weights of the feature extractor are shared across all images.
Given the keyframe depth estimate and the current motion estimate , we can render from the estimated viewpoint of the keyframe camera using Equation 5 (applying the projective warping function ) to produce the warped feature map .
The objective of the motion module is to find a transformation which aligns and . Each iteration of the motion module takes the rendered feature image pair and produces a transformation which updates the current motion estimate. As the motion estimate improves, the feature images become more and more aligned, resulting in smaller incremental updates. Instead of relying on an initial motion estimate, we can test how each motion estimate agrees with the inverse projection and propose updates to correct the error.
We use lie-algebra elements to parameterize camera motion updates. A element can be mapped to with the exponential mapping . The group operator is matrix multiplication. Given the vector we can perform a camera motion update on :
Residual Motion Network: We can now describe in greater detail how the motion updates are computed. Starting with keyframe depth , feature images (), and the motion estimate , our residual motion network generates a motion update element . As shown in Figure 3, the residual motion network is applied iteratively.
In the iteration step, we estimate the optical flow between the rendered image and which we term the residual flow—denoted . To estimate flow, we concatenate the feature maps use an encoder-decoder network with skip connections modeled after FlowNetS . This network is not directly supervised on flow, but instead the flow is used as an intermediate to produce the motion update.
The residual flow tells us the 2D motion between the rendered feature images. Our residual motion network produces a update to correct this motion. We propose the transform layer which is differentiable and maps the 2D motion into a 3D rigid body update. More formally, the input to this layer is the residual flow , the keyframe depth , and the camera parameters. The output is a lie-algebra element which we use to perform a motion update following Equation 7.
For a given keyframe point , we can find its location in reference frame . We define to be the reprojected point under a camera transformation: . We find a motion update such that the distance between and matches the residual flow. We formalize this as the following objective function:
Since this is a sum of squares, an update can be proposed with a Gauss-Newton iteration:
where is the stack of derivatives.
Here depends only on the depth, , and the camera intrinsics; thus, the term is independent of the residual flow. Taking this term to be constant, the motion update is simply a linear mapping of the residual flow, encoding geometry and allowing the gradient to be easily backpropogated through the single Gauss-Newton iteration. This property means we do not need any intermediate supervision on flow.
In practice, we don’t need dense flow to compute camera motion. We train the network to output 2 additional channels constrained to with the sigmoid activation. weight the the residual in the respective x,y directions. We update Equation 8 accordingly
Again, we provide no intermediate supervision. The network learns on its own to weight different image regions.
Deep learning has presented a new alternative for estimating camera motion. By predicting motion in the image space, our network doesn’t need to learn the geometric relationships between depth, camera motion, and optical flow. We can apply geometric knowledge by directly mapping depth and optical flow to a camera motion estimate. Furthermore, our formulation is more general than methods which predict camera parameters directly. Our minimization step takes as input the camera intrinsics, meaning that as long as the optical flow is accurate, our method can generalize to situations with different image dimensions and camera parameters.
Depth Loss: We supervise depth on the mean absolute error between the predicted and ground truth depth over the set of pixels with valid ground truth depths denoted .
We also include a small L1 smoothness term on the pixels where depth is missing
giving the loss: .
We craft a loss function which avoids the need to consider rotation and translation independently and instead penalize the network on reprojection error directly. During the forward pass, our network outputs a sequence of pose estimatesfor each of the images. We define the loss between two poses to be the mean huber-distance of the reprojected points with :
The reprojection error is taken to be the sum of errors between the predicted and ground truth pose over each sequence and image:
Additionally, we want the predicted residual flow to match the motion update. We add a regularization penalty on the residual flow following each Gauss-Newton update by penalizing on the weighted squared error term in Equation 10. Finally, we want to avoid the degenerate case where all flow weights become 0. We take the logloss of the top K weights. The final motion loss is the combination:
We test our approach on NYUv2  and the KITTI  datasets and compare to both classical SfM and monocular depth estimation approaches. We apply the following 2-stage training approach. We perform data augmentation by adjusting brightness, gamma, and performing random scaling of the image channels.
Stage I: We train the Motion Module using the
loss with RMSProp. For the input depth, we use the ground truth depth with missing values filled in with nearest neighbor interpolation.
Stage II: In stage II, we jointly train the Motion and Stereo modules end-to-end on the combination of motion and depth loss terms with RMSProp. Again, DeepV2D requires an additional depth estimate. For each training instance, we choose between two options: (1) use the ground truth depth or (2) use the depth predicted last time this training instance was encountered. During training we decay the probability of ground truth initialization.
Timing Information: On the NYU dataset our system operates at 340ms per iteration for a 5 frame video with 480x640 input resolution. For 5 192x1088 frames on KITTI, DeepV2D runs at 230ms/iteration.
. NYU provides a challenging benchmark to test our approach. Unlike other datasets such as KITTI where camera motion is mostly planar, NYU exhibits more complex motion which span all degrees of freedom.
We train Stage I for 50k iterations with a batch size of 4, and train Stage II for 160k iterations with a batch size of 1. We set the number of residual iterations in the motion module to again be 3. During training, we sample a set of target frames uniformly from the raw distribution. For each target frame, we sample 6 neighboring frames spaced approximately 0.25s apart. At each training iteration, we randomly sample 3 of the 6 frames. We use the full 480x640 images. NYU does not have ground truth camera pose data, but we are able to generate good estimates applying RGB-D SLAM . At test time, we initialize the depth estimate with a single-image depth network .
NYU Results: We show some example NYU results of DeepV2D in Figure 4. We are able to add a significant level of detail over the baseline monocular network and often make large corrections. Like classical SfM, reflective surfaces are difficult to recover. Overall, DeepV2D produces accurate and detailed depth reconstructions.
|Lower is better||Higher is better|
|Eigen et al ||0.215||0.907||0.285||0.611||0.887||0.971|
|Eigen and Fergus ||0.158||0.641||0.214||0.148||0.769||0.950||0.988|
|Laina et al ||0.127||0.573||0.195||-||0.811||0.953||0.988|
|DeMon et al ||-||-||-||0.180||-||-||-|
In Table 1 we compare to single-image depth estimation networks including the baseline single-image initialization FCRN . For reference, we also include the networks from Eigen and Fergus , and DeMon  which uses two frames to reconstruct depth. However, DeMoN was not trained directly on NYU due to insufficient supervision. We outperform the baseline network improving the challenging metric from 0.811 to 0.858.
For a direct comparison with classical SfM, we perform median matching as done in  to resolve global scale ambiguity (Table 1). We gather classical SfM results with the publicly available colmap  and DfUSMC , fixing camera intrinsics to the calibrated values. Both  and  are able to generate accurate and highly detailed reconstructions on many of the test images; however, they struggle to recover low texture scenes, producing large final errors. By using learned features and structural priors, DeepV2D can circumvent many of these failure cases.
|KITTI Raw||Stereo||Abs Rel||Sq Rel||RMSE||RMSE log|
|Eigen et al.  Fine||0.702||0.890||0.958||0.203||1.548||6.307||0.282|
|Goddard et al.  (+City Scapes)||0.861||0.949||0.976||0.114||0.898||4.935||0.206|
|Kuznietzov et al. ||0.862||0.960||0.986||0.113||0.741||4.621||0.189|
|DORN (vgg) ||0.915||0.980||0.993||0.081||0.376||3.056||0.132|
|DORN (resnet101) ||0.932||0.984||0.994||0.072||0.307||2.727||0.120|
|Goddard et al. ||0.803||0.922||0.964||0.148||1.344||5.927||0.247|
|Yang et al ||0.888||0.958||0.980||0.097||0.734||4.442||0.187|
|Ours (no stereo)||0.831||0.942||0.977||0.135||0.949||4.932||0.210|
NYU Ablations: DeepV2D introduces an iterative method for depth estimation. In Figure 5 we look at the convergence properties of our proposed system by plotting the scale matched absolute relative error (abs rel) as a function of the number of motion/stereo module iterations. The baseline model is initialized with a single image depth estimate from . We test a version where we instead initialize with a flat depth estimate of 5m (labeled w/o init). While convergence is slower, we can still recover good depth estimates with enough iterations. Additionally, we test a version of our system where we replace the transform layer with a network head which predicts the camera motion update directly, keeping all other components the same. Results show that this hurts performance, indicating that our layer is beneficial for accurate depth estimation.
We can also visualize the regions which the motion module attends to in Figure 6. The motion module predicts two residual flow weights and which reflect the confidence of the flow vector in the respective directions and learns to up-weight edges and salient image regions.
KITTI: For completeness, we evaluate different variants of our proposed approach on the KITTI driving benchmark  and compare to single-image and stereo based approaches. For testing we follow the Eigen train/test split proposed in . The KITTI dataset contains many dynamically moving objects presenting a challenging scenario for stereopis.
We use an input sequence of 5 images for our network formed by taking the two closest frames before and after the target image. We generate ground truth depth by reprojecting 3D velodyne points onto the left color camera and resize images to 292x1088 pixels and crop the top 100 pixels for an input size of 192x1088, and ground truth motion from the gps/imu files.
We train Stage I for 20k iterations with a batch size of 4, and train Stage II for 200k iterations with a batch size of 1. We set the number of residual iterations in the motion module to be 3. At test time we bootstrap the reconstruction process by using the output from the pose regression network, which is trained with all 5 frames as input.
We provide quantitative results in Table 3. Goddard et. al  Yang et al  trains single-image depth networks supervised on photo-consistency measure between stereo pairs. Kuznietzov et al.  combines photo-consistency with velodyne data. DORN 
is a single-image network which is initialized with a pretrained resnet model. We outperform our classical baseline and all methods not using external data. We are competitive with DORN on the outlier robustmetric.
We test a baseline model where we disable stereo cues by training the network on videos containing only the keyframe. This result shows that while our network has the capacity to estimate depth using no stereo information, stereopis greatly improves results, reducing the error by more than 50%.
In Figure 7 we provide some visualizations of DeepV2D depth predictions. We are able to recover a significant level of detail, including thin structures such as poles and trees.
SUN3D, DeepTAM Comparison: We compare the performance of our method to DeepTAM  on the SUN3D dataset. We follow the setup in DeepTAM where depth estimation is tested in isolation, using the camera motion provided in the dataset. For DeMon, SGM, and DTAM we report the results as provided by . As shown in Table 4 show, we outperform classical mapping systems like Semi-Global mapping  and DTAM . DeepTAM uses 3 networks to recover depth including a single-image refinement network. We outperform DeepTAM despite using only a single network and applying no refinement.
We propose DeepV2D, an end-to-end differentiable system which applies both monocular and geometric cues to predict 3D structure from a video sequence. DeepV2D is built from a set of geometric modules based on classical SfM operations and applies stereopis to recover depth.
Acknowledgements We would like to thank Zhaoheng Zheng for helping with baseline experiments. This work was partially funded by the Toyota Research Institute.
Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018.
COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.