1 Introduction
Supervised machine learning techniques based on deep neural networks have shown remarkable recent progress for image recognition and segmentation tasks. However, progress in applying these powerful methods to geometric tasks such as structurefrommotion has been somewhat slower due to a number of factors. One challenge is that standard layers defined in convolutional neural network (CNN) architectures do not offer a natural way for researchers to incorporate hardwon insights about the algebraic structure of geometric vision problems, instead relying on general approximation properties of the network to rediscover these facts from training examples. This has resulted in some development of new building blocks (layers) specialized for geometric computations that can function inside standard gradientbased optimization frameworks (see e.g.,
[1, 2]) but interfacing these to image data is still a challenge.A second difficulty is that optimizing convolutional neural networks (CNNs) requires large amounts of training data with groundtruth labels. Such groundtruth data is often not available for geometric problems (i.e., often requires expensive specialpurpose hardware rather than human annotations). This challenge has driven recent effort to develop more realistic synthetic datasets such as Flying Chairs and MPISintel [3] for flow and disparity estimation, Virtual KITTI [4] for object detection and tracking, semantic segmentation, flow and depth estimation, and SUNCG [5] for indoor room layout, depth and normal estimation.
In this paper, we overcome some of these difficulties by taking a “selfsupervised” approach to learning to estimate camera motions directly from video. Selfsupervision utilizes unlabeled image data by constructing an encoder that transforms the image into an alternate representation and a decoder that maps back to the original image. This approach has been widely for lowlevel synthesis problems such as superresolution
[6][7] and inpainting [8] where the encoder is fixed (creating a downsampled, grayscale or occluded version of the image) and the decoder is trained to reproduce the original image. For estimation tasks such as human pose [9], depth [10, 11], and intrinsic image decomposition [12], the structure of the decoder is typically specified by hand (e.g., synthesizing the next video frame in a sequence based on estimated optical flow and previous video frame) and the encoder is learned. This framework is appealing for geometric estimation problems since (a) it doesn’t require human supervision to generate target labels and hence can be trained on large, diverse data, and (b) the predictive component of the model can incorporate user insights into the problem structure.Our basic model takes a pair of calibrated RGB or RGBD video frames as input, estimates optical flow and depth, determines camera and object velocities, and resynthesizes the corresponding motion fields. We show that the model can be trained endtoend with a selfsupervised loss that enforces consistency of the predicted motion fields with the input frames yields a system that provides highly accurate estimates of camera egomotion. We measure the effectiveness of our method using TUM [13] and Virtual KITTI [4] dataset.
Relative to other recent papers [14, 10, 11] that have also investigated selfsupervision for structurefrommotion, the novel contributions of our work are:

[label=]

We represent camera motion implicitly in terms of motion fields and depth which are a better match for CNNs architectures that naturally operate in the image domain (rather than camera parameter space). We demonstrate that this choice yields better predictive performance, even when trained in the fully supervised setting

Unlike previous selfsupervised techniques, our model uses a continuous (linearized) approximation to camera motion [15, 16]
which suitable for video odometry and allows efficient backpropagation while providing strong constraints for learning from unsupervised data.

Our experimental results demonstrate stateoftheart performance on benchmark datasets which include nonrigid scene motion due to dynamic objects. Our model improves on substantially on estimates of camera rotation, suggesting this approach can serve well as a dropin replacement for local estimation in existing RGB(D) SLAM pipelines.
2 Related Work
Visual odometry is a classic and well studied problem in computer vision. Here we mention a few recent works that are most closely related to our approach.
Optical Flow, Depth and Odometry: A number of recent papers have shown great success in estimation of optical flow from video using learningbased techniques [17, 18]. Ren et al.
introduced unsupervised learning for optical flow prediction
[19] using photometric consistency. Garg et al. utilize consistency between stereo pairs to learn monocular depth estimation in a selfsupervised manner [20]. Zhou et al. [11] jointly trains estimators for monocular depth and relative pose using an unsupervised loss. SfMNet [10] takes a similar approach but explicitly decomposes the input into multiple motion layers. [21] uses stereo video for joint training of depth and camera motion (sometimes referred to as scene flow) but tests on monocular sequences. Our approach differs from these recent papers in using a continuous formulation appropriate for video. Such a formulation was recently used by Jaegle et al.[16] for robust monocular egomotion estimation but using classic (sparse) optical flow as input.SLAM: While conventional simultaneous localization and mapping (SLAM) methods estimate geometric information by extracting feature points [22, 23] or use all information in the given images [24], recently several learningbased methods have been introduced. Tateno et al. [25] propose a fusion SLAM technique by utilizing CNN based depth map prediction and monocular SLAM. Melekhov et al. propose CNN based relative pose estimation using endtoend training with a spatial pyramid pooling (SPP) [26]. Other recent works [27, 28] model static background to predict accurate camera pose even in dynamic environment. Sun et al. try to solve dynamic scene problem by adding motion removal approach as a preprocessing to be integrated into RGBD SLAM [29]. Finally, the work of Wang et al. [30] train a recurrent CNN to capture longerterm processing of sequences typically handled by bundle adjustment and loop closure.
3 Continuous Egomotion Network
Figure 1 provides an overview of three different types of architectures we consider in this paper. We take as input a successive pair of RGB images and corresponding depth images . When depth is not available, we assume it is predicted by a monocular depth estimator (not shown). The first network, , directly predicts 6 DoF camera motion by attaching several fully connected layers at the end of several CNN layers. When camera motion is known, this baseline can be trained with a supervised loss or trained with a selfsupervised image warping loss as done in several recent papers [11, 14, 10].
Instead of directly predicting camera motion, we advocate utilizing a fullyconvolutional encoder/decoder architecture with skip connections (e.g., [31, 32, 17, 18]) to first predict optical flow (denoted ). We then estimate continuous egomotion using weighted leastsquares and resynthesize the corresponding motion field . These intermediate representations can be learned using unsupervised losses (,, ) described below. When additional moving objects are present in the scene, we introduce an additional segmentation network, , which decomposes the optical flow into layers which are fit separately.
In the following sections we develop the continuous motion formulation, interpret our model as projecting the predicted optical flow on to the subspace of egomotion flows, and discuss implementation of segmentation into layers.
3.1 Estimating Continuous Egomotion
Consider the 2D trajectory of a point in the image as a function of its 3D position and motion relative to the camera. We write
where is the camera focal length. To compute the projected velocity in the image as a function of the 3D velocity we take partial derivatives. For example, the component of the velocity is:
Dropping for notational simplicity, we can thus write the image velocity as:
(1) 
where the matrix is given by:
In the continuous formulation, the velocity of the point relative to the camera arises from a combination of translational and rotational motions,
where is unit length axis representation of rotational velocity of the camera and is the translation. Denoting the inverse depth at image location by
, we can see that the projected motion vector
is a linear function of the camera motion parameters:where the matrix includes the cross product
To describe motion field for the whole image, we concatenate equations for all pixel locations and write where
We assume the focal length is a fixed quantity and in the following write the motion field as a function which is linear in both the inverse depths and camera motion parameters .
To infer the camera motion given inverse depths and image velocities , we use a leastsquares estimate:
where is a weighting function that models the reliability of each pixel velocity in estimating the camera motion. The solution to this problem can be expressed in closed from using the pseudo inverse of matrix . We denote the mapping from to estimated camera motion as .
In our model we utilize to estimate camera model and to resynthesize the resulting motion field. Both functions are differentiable with respect to their inputs (in fact linear in and respectively) making it straightforward and efficient to incorporate them into a network that is trained endtoend using gradientbased methods.
(a) Predict pose directly  (b) Predict pose via flow space  (c) losses comparison 
Schematic interpretation of different loss functions. (a) Supervised training of direct models utilize a loss defined on camera pose space. (b) Our approach defines losses on the space of pixel flows and considers losses that measure the distance to the true motion field, the subspace of possible egomotion fields (blue), and its orthogonal complement (gray dashed). The model is also guided by photometric or sceneflow consistency between input frames (yellow) (c) shows prediction error for supervised models trained with different combinations of these losses and indicates that using losses defined in flowspace outperforms direct prediction of camera motion.
(a)  (d) Optical Flow  (g)  (j) 
(b)  (e)  (h)  (k) 
(c)  (f)  (i)  (l) 
3.2 Projecting optical flow onto egomotion
Given the true motion field , it is straight forward to estimate the the true camera motion . In practice, the motion must be estimated from image data which is often ambiguous (e.g., due to lack of texture) and noisy. Typically there is a large set of image flows that are photometrically consistent from which we must select the true motion field. Our architecture utilizes a CNN to generate an initial flow estimate from image data, then uses to fit a camera motion and finally reconstructs the image motion field corresponding to the camera motion. The composition of and can be seen as a linear projection of the initial flow estimate into the space of continuous motion fields.
A key tenet of our approach is that it is a better match to the capabilities of a CNN architecture to predict the ego motion field in the image domain (and subsequently map to camera motion) rather than attempting to directly predict in the camera pose space. In particular, this allows for richer loss functions that guide the training of the network. We illustrate these idea schematically for the case of supervised learning in Figure 2. Panel (a) depicts the direct approach in terms of a loss function whose gradient pulls the predicted pose towards the true pose.
We display the relationship between optical flow, motion field and camera pose in Figure 2(b). Among all possible image flows , we indicate in yellow the set which are photometrically valid (i.e., have a zero warping loss ). The blue line indicates the 6dimensional subspace consisting of those motion fields that can be generated by all possible camera velocities (conditioned on scene depth). Introducing a loss on the camera pose (either directly on the prediction , or on the resynthesized motion field serves to pull the flow prediction towards the orthogonal complement of this space (i.e., the set denoted by the gray vertical line).
Our approach allows the consideration of two other loss functions that can provide additional guidance. When supervision is available, we can introduce a loss which directly measures the distance between the predicted flow and the true motion field ( in the figure). In the selfsupervised setting, we can approximate this with the photometric warping loss . Additionally, in either supervised or unsupervised settings, we can include an orthogonal projection loss , which encourages the model to predict flows which are close to the space of motion fields. In section , we describe how these losses are computed and adapted to the unsupervised setting.
While all of these losses are minimized in a perfect model, Figure 2(c) shows that this choice of loss during training as a substantial practical effect. In the supervised setting, optimizing the direct loss in the camera pose space (using generic fully connected layers), or in the flow space (using our leastsquares fitting) results in similar prediction errors. However, adding the projection loss or directly minimizing the distance to the true motion field yields substantially better predictions (i.e., halving average camera translation error).
3.3 Static and Dynamic Motion Layers
So far, our description has assumed a camera moving through a single rigid scene. A standard approach to modeling nonrigid scenes (e.g., due to relative motion of multiple dynamic objects in addition to egomotion) is to split the scene into a number of layers where each layer has a separate motion model [33]. For example, Zhou et al. use a binary “explainable mask” [11] to exclude outlying motions, and Vijayanarasimhan et al. segment images into K regions based on motion [10]. However, in the latercase, there is no distinction between object motion and ego motion making it inappropriate for odometry.
We use a similar strategy in order to separate motion into two layers corresponding to static background and dynamic objects (outliers). We adopt a unetlike segmentation network
[31] to predict this separation which then defines the weights used for camera motion estimation using pseudo inverse function described in Section 3.1.Consider a scene divided into regions corresponding to moving objects and rigid background. Let denote a mask that indicates the image support of region and denote the corresponding rigid motion field for that object considered in isolation. The composite motion field for the whole image can be written as:
In the odometry setting, we are only interested in the motion of the camera relative to static background. We thus collect any dynamic objects into a single motion field and consider a single binary mask:
In our training with this segmentation network, we use the approximated motion field for the photometric warping loss described below. For simplicity, we refer our single layer model as and dual layer model as
In Figure 3, we illustrate intermediate results demonstrating how the two layered model can better estimate camera motion in the presence of dynamic objects. Since the single layer model cannot distinguish background and foreground, the quality of predicted camera pose is bad. Excluding the dynamic scene components from the camera motion estimation provides substantially better pose estimation as seen in panels (i) and (l) which show less photometric warping error on the scene background relative to the single layer model shown in (f).
Hard assignment to layers: Previous work such as [10] uses a soft probabilistic prediction of layer membership (i.e., using a softmax function to generate layer weights). However, such an approach introduces degeneracy since it can utilize weighted combinations of two motions to match the flow (e.g., even in a completely rigid scene). We find that using hard assignment of motions to layers yields superior camera motion estimates. We utilize the “Gumbel sampling trick” described in [34] to implement hard assignment while still allowing differentiable endtoend training of both the flow and segment networks.
4 Training Losses
4.1 Losses for Selfsupervision
As described in Section 3.2, there are several different losses which can be applied to predicted flows. Here we adapt them to the selfsupervised setting. The basic building block is to check if a predicted flow is photometrically consistent with the input image pairs.
For a given optical flow and source image we can synthesize warped image and check if it matches . As described in [35]
, this type of spatial transformation can be carried out in a differentiable framework using bilinear interpolation:
where denotes the bilinear weighting of the four sample points. For simplicity, we write to denote the warping of using flow . We then define the selfsupervised flow loss using the photometric error over all pixels:
This loss serves as an approximation of when the predictions are far from the true motion field.
We can similarly apply warping loss to the reconstructed motion field rather than the initial prediction. If the motion field we found is correct, then again, the warped image should be matched with the target image. We can build motion field loss by using motionfield warped image as:
where the mask is 1 when the depth at is valid, 0 otherwise. This is necessary when using a depth sensor which doesn’t provide depths at every image location. This loss acts as a proxy for minimizing the camera motion estimation error by lifting the prediction back to the flow space. When we predict camera motion for static scene, we use global motion field, and for the dynamic scene, we use composite motion field .
(a) 
(b)  (c) Optical flow  (d) Motion field  (e) 
(a) Translation Error  (b) Rotation Error 
Finally, we can utilize the orthogonal projection loss to minimize the distance between predicted optical flow and its projection onto the space of motion fields via:
By combining three above losses, we can define the final selfsupervised loss function
where , and weigh relative importance (we use 1, 0.1 and 0.1 respectively in our experiments).
4.2 Semisupervision for symmetry breaking
In our segmentation network, we have two layers corresponding to static and dynamic parts. However, in the unsupervised setting, the loss is symmetric with respect to which is selected as background. This symmetry problem can interfere with training of the model and affect final performance. To break this symmetry, we found it most effective to utilize a small amount of supervised data where camera motion is known. For the supervised data we use an additional loss term on the camera parameters.
4.3 Camera Supervision from axisangle representation
Our network predicts camera motion in an axisangle representation that includes translation part and rotation . For supervised loss, we treat these two components separately in order to match the criteria typically used in benchmarking pose estimation performance.
We first convert the axisangle representation to a rotation matrix using quaternions and then combine with the translation velocity to yield a transformation matrix . Following [13], we compute the difference between our predicted transformation and the ground truth and penalize the translation and rotation components respectively by:
5 Experimental Results
For the following experiments, we use the synthetic Virtual KITTI dataset [4] depicting street scenes from a moving car, and the TUM RGBD dataset [13] which has been used to benchmark a variety of RGBD odometry algorithms. To measure performance, we use relative pose error protocol proposed in [13].
Selfsupervised learning improves model performance: To show the benefits of selfsupervision, we assume that only 10% of each dataset has groundtruth available. We use 11 different sequences from the TUM dataset as training, choose a random ordering of frame pairs over the whole dataset and train models with increasingly large subsets of the data and test on a separate heldout collection of frames. This allows us to evaluate the effect of growing the amount of supervised/unsupervised training data in a consistent way across models.
In Figure 5, we plot the relative translation/rotation errors as a function of training data size. The supervised version of the model (CeMSup) can only be trained on the first 10% of the dataset and makes no use of the unsupervised data. In this setting it outperforms the unsupervised model (CeMUnsup). However, as the amount of unsupervised training data continues to grow, CeMUnsup eventually outperforms the supervised model. For a clear comparison, the unsupervised losses are not used in training (CeMSup). We also compare a model which uses both supervised and unsupervised loss (CeMSemiSup) which generally yields even better performance. We note that because the real world depth data in TUM is incomplete, limiting performance of the supervised model while the supervised model shows expected decreasing errors on Virtual KITTI.
Motion field and warping: In Section 4.1, we describe how a predicted camera pose is used to generate motion field and used in the warping loss. In Figure 4, we plot the perpixel warping loss for several inputs. Left two (ab) show the input RGB frames, (c) shows predicted optical flow. (d) is regenerated motion field. (e) shows differences between the target image and warped image. Note that blue color means lower differences between those two images.
Camera motion error comparison: To measure the quality of predicted camera pose, we compare our single layer model (CeMNet) with previous RGBD SLAM methods on the TUM dataset in Table 1. CeMNet(RGBD) shows the best average performance among tested methods in terms of relative translation error. Several previous methods of interest, including [11, 10] do not utilize depth as an input, instead predicting it directly from input images.
For fair comparison, we also test our model with predicted depth (CeMNet(RGB)) using offtheshelf the monocular depth prediction method introduced by Iro et al. [38]. This model was pretrained using NYU Depth dataset V2 [39]. We rescale the predictions by 0.9 to match the range of depths in TUM (presumably due to differences in focal length) but otherwise leave the model fixed. focal length for TUM. As shown in Table 2, our method continues to outperform others in terms of rotation and shows comparable translation errors. As another comparison, we use KITTI [40] dataset for absolute trajectory error in Table 4. For training, we use sequence from 00 to 08, and use 09 and 10 for each evaluation.
Additionally, we show performance on the Virtual KITTI dataset in Table 4. We specify how each method uses the available ground truth depth and camera pose data available for train and test. Using the true depth at test time results in strong performance from our model. For fair comparison, we also evaluate our model using the monocular depth prediction model of [41] pretrained with KITTI [40] dataset and converted from the predicted disparity to depth^{1}^{1}1We use 0.54 as baseline distance and 725 for focal length. The results show better performance than previous selfsupervised approaches even without using groundtruth depth.
(a)  (b) Optical  (c)  (c)  (d)  (e)  (f) 
flow  (all)  (all)  (static)  (static)  (static) 
Seq.  Baseline  (Semi)  

Trans  Rot  Trans  Rot  Trans  Rot  Trans  Rot  
fr3/sit_static  0.0134  0.5724  0.0025  0.1667  0.0016  0.1573  0.0010  0.1527 
fr3/sit_xyz  0.0179  0.7484  0.0070  0.2645  0.0068  0.2653  0.0064  0.2612 
fr3/sit_halfsph  0.0104  1.0135  0.0081  0.5272  0.0080  0.5820  0.0074  0.5552 
fr3/walk_static  0.0149  0.5703  0.0103  0.2107  0.0030  0.1610  0.0019  0.1583 
fr3/walk_xyz  0.0174  0.7952  0.0128  0.3338  0.0079  0.2915  0.0078  0.2921 
fr3/walk_halfsph  0.0166  0.9426  0.0147  0.4698  0.0107  0.4120  0.0102  0.3989 
Static/Dynamic segmentation: In Figure 6, we visualize the results of breaking the input into static and dynamic layers. From the RGB input pair at (a) and , predicted optical flow is shown in (b). While single layered model generates motion field using the complete flow (c), two layered model focuses on static region (d) and generates motion field by only using it (e). The warping error from the total flow (c) is higher than (f) especially in background region.
We perform a quantitative comparison on the TUM dynamic dataset which includes both object and camera motion. The results results are shown in Table 5. While single layered models such as the baseline direct prediction model and are sensitive to dynamic objects, two layered model shows less pose error. However, as noted previously, the unsupervised loss suffers from a symmetry as to which layer correspond to egomotion. We evaluate the use of a small amount of supervised data (10%) to break this symmetry in the segmentation prediction network. This yields the the lowest resulting motion errors across nearly all test sequences.
6 Conclusion
In this paper, we have introduced a novel selfsupervised approach for egomotion prediction that leverages a continuous formulation of camera motion. This allows for linear projection of flows into the space of motion fields and (differentiable) endtoend training. Compared to direct prediction of camera motion (both our own baseline implementation and previously reported performance), this approach yields more accurate twoframe estimates of camera motions for both RGBD and RGB odometry. Our model makes effective use of selfsupervised training, allowing it to make effective use of “free” unsupervised data. Finally, by utilizing a twolayer segmentation approach makes the model further robust to the presence of dynamic objects in a scene which otherwise interfere with accurate egomotion estimation.
Acknowledgements: This project was supported by NSF grants IIS1618806, IIS1253538 and a hardware donation from NVIDIA.
References
 [1] Handa, A., Bloesch, M., Pătrăucean, V., Stent, S., McCormac, J., Davison, A.: gvnn: Neural network library for geometric computer vision. In: ECCV, Springer (2016) 67–82
 [2] Huang, Z., Wan, C., Probst, T., Gool, L.V.: Deep learning on lie groups for skeletonbased action recognition. CVPR (2017) 1243–1252
 [3] Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: ECCV. (2012) 611–625
 [4] Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multiobject tracking analysis. In: CVPR. (2016)
 [5] Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. CVPR (2017)
 [6] Dong, C., Loy, C.C., He, K., Tang, X.: Image superresolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(2) (2016) 295–307
 [7] Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV. (2016)
 [8] Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.: Context encoders: Feature learning by inpainting. (2016)
 [9] Tung, H., Wei, H., Yumer, E., Fragkiadaki, K.: Selfsupervised learning of motion capture. In: NIPS. (2017)
 [10] Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: Sfmnet: Learning of structure and motion from video. CoRR (2017)
 [11] Zhou, T., Brown, M., Snavely, N., Lowe, D.: Unsupervised learning of depth and egomotion from video. In: CVPR. (2017)
 [12] Janner, M., Wu, J., Kulkarni, T., Yildirim, I., Tenenbaum, J.B.: SelfSupervised Intrinsic Image Decomposition. In: NIPS. (2017)
 [13] Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgbd slam systems. In: IROS. (Oct. 2012)

[14]
Tung, H.F., Harley, A.W., Seto, W., Fragkiadaki, K.:
Adversarial inverse graphics networks: Learning 2dto3d lifting and imagetoimage translation from unpaired supervision.
ICCV (2017)  [15] Pajdla, T., Matas, J., eds.: The LeastSquares Error for Structure from Infinitesimal Motion. In Pajdla, T., Matas, J., eds.: ECCV. (2004)
 [16] Jaegle, A., Phillips, S., Daniilidis, K.: Fast, robust, continuous monocular egomotion computation. In: 2016 IEEE International Conference on Robotics and Automation (ICRA). (May 2016) 773–780
 [17] Dosovitskiy, A., Fischery, P., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., d. Smagt, P.v., Cremers, D., Brox, T.: FlowNet: Learning optical flow with convolutional networks. In: ICCV. (December 2015) 2758–2766
 [18] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. CVPR (2016)
 [19] Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning for optical flow estimation. In: AAAI. (2017)
 [20] Garg, R., B.G., V.K., Carneiro, G., Reid, I. In: Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. Springer International Publishing (2016) 740–756
 [21] Li, R., Wang, S., Long, Z., Gu, D.: UnDeepVO: Monocular visual odometry through unsupervised deep learning. arXiv (2017)
 [22] Kerl, C., Sturm, J., Cremers, D.: Dense visual slam for rgbd cameras. In: Proc. of the Int. Conf. on Intelligent Robot Systems (IROS). (2013)
 [23] Whelan, T., Leutenegger, S., Moreno, R.S., Glocker, B., Davison, A.: Elasticfusion: Dense slam without a pose graph. In: Proceedings of Robotics: Science and Systems. (2015)
 [24] Engel, J., Cremers, D.: Lsdslam: Largescale direct monocular slam. In: ECCV. (2014)
 [25] Tateno, K., Tombari, F., Laina, I., Navab, N.: Cnnslam: Realtime dense monocular slam with learned depth prediction. CVPR (2017)
 [26] Melekhov, I., Kannala, J., Rahtu, E.: Relative camera pose estimation using convolutional neural networks. arXiv (2017)
 [27] Kim, D.H., Kim, J.H.: Effective background modelbased rgbd dense visual odometry in a dynamic environment. IEEE Transactions on Robotics 32(6) (Dec 2016) 1565–1573
 [28] Li, S., Lee, D.: Rgbd slam in dynamic environments using static point weighting. IEEE Robotics and Automation Letters 2(4) (2017) 2263–2270
 [29] Sun, Y., Liu, M., Meng, M.Q.H.: Improving rgbd slam in dynamic environments: A motion removal approach. Robotics and Autonomous Systems 89 (2017) 110 – 122
 [30] Wang, S., Clark, R., Wen, H., Trigoni, N.: Deepvo: Towards endtoend visual odometry with deep recurrent convolutional neural networks. In: ICRA. (2017) 2043–2050
 [31] Ronneberger, O., P.Fischer, Brox, T.: Unet: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and ComputerAssisted Intervention (MICCAI). Volume 9351. (2015) 234–241
 [32] Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. TPAMI 39(4) (April 2017)
 [33] Weiss, Y.: Smoothness in layers: Motion segmentation using nonparametric mixture estimation. In: CVPR. (Jun 1997) 520–526
 [34] Veit, A., Belongie, S.J.: Convolutional networks with adaptive computation graphs. CoRR (2017)
 [35] Jaderberg, M., Simonyan, K., Zisserman, A., kavukcuoglu, k.: Spatial transformer networks. In Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., eds.: Advances in Neural Information Processing Systems 28. Curran Associates, Inc. (2015) 2017–2025
 [36] Whelan, T., McDonald, J., Kaess, M., Fallon, M., Johannsson, H., Leonard, J.J.: Kintinuous: Spatially extended kinectfusion. In: RSS Workshop on RGBD: Advanced Reasoning with Depth Cameras. (July 2012)
 [37] MurArtal, R., Tardós, J.D.: Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras. IEEE Transactions on Robotics 33(5) (2017) 1255–1262
 [38] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3D Vision (3DV), 2016 Fourth International Conference on, IEEE (2016) 239–248
 [39] Nathan Silberman, Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV. (2012)
 [40] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR. (2012)
 [41] Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with leftright consistency. In: CVPR. (2017)