CeMNet: Self-supervised learning for accurate continuous ego-motion estimation

06/27/2018 ∙ by Minhaeng Lee, et al. ∙ 0

In this paper, we propose a novel self-supervised learning model for estimating continuous ego-motion from video. Our model learns to estimate camera motion by watching RGBD or RGB video streams and determining translational and rotation velocities that correctly predict the appearance of future frames. Our approach differs from other recent work on self-supervised structure-from-motion in its use of a continuous motion formulation and representation of rigid motion fields rather than direct prediction of camera parameters. To make estimation robust in dynamic environments with multiple moving objects, we introduce a simple two-component segmentation process that isolates the rigid background environment from dynamic scene elements. We demonstrate state-of-the-art accuracy of the self-trained model on several benchmark ego-motion datasets and highlight the ability of the model to provide superior rotational accuracy and handling of non-rigid scene motions.



There are no comments yet.


page 7

page 10

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised machine learning techniques based on deep neural networks have shown remarkable recent progress for image recognition and segmentation tasks. However, progress in applying these powerful methods to geometric tasks such as structure-from-motion has been somewhat slower due to a number of factors. One challenge is that standard layers defined in convolutional neural network (CNN) architectures do not offer a natural way for researchers to incorporate hard-won insights about the algebraic structure of geometric vision problems, instead relying on general approximation properties of the network to re-discover these facts from training examples. This has resulted in some development of new building blocks (layers) specialized for geometric computations that can function inside standard gradient-based optimization frameworks (see e.g.,

[1, 2]) but interfacing these to image data is still a challenge.

A second difficulty is that optimizing convolutional neural networks (CNNs) requires large amounts of training data with ground-truth labels. Such ground-truth data is often not available for geometric problems (i.e., often requires expensive special-purpose hardware rather than human annotations). This challenge has driven recent effort to develop more realistic synthetic datasets such as Flying Chairs and MPI-Sintel [3] for flow and disparity estimation, Virtual KITTI [4] for object detection and tracking, semantic segmentation, flow and depth estimation, and SUNCG [5] for indoor room layout, depth and normal estimation.

In this paper, we overcome some of these difficulties by taking a “self-supervised” approach to learning to estimate camera motions directly from video. Self-supervision utilizes unlabeled image data by constructing an encoder that transforms the image into an alternate representation and a decoder that maps back to the original image. This approach has been widely for low-level synthesis problems such as super-resolution 


, image colorization 

[7] and in-painting [8] where the encoder is fixed (creating a downsampled, grayscale or occluded version of the image) and the decoder is trained to reproduce the original image. For estimation tasks such as human pose [9], depth [10, 11], and intrinsic image decomposition [12], the structure of the decoder is typically specified by hand (e.g., synthesizing the next video frame in a sequence based on estimated optical flow and previous video frame) and the encoder is learned. This framework is appealing for geometric estimation problems since (a) it doesn’t require human supervision to generate target labels and hence can be trained on large, diverse data, and (b) the predictive component of the model can incorporate user insights into the problem structure.

Our basic model takes a pair of calibrated RGB or RGBD video frames as input, estimates optical flow and depth, determines camera and object velocities, and resynthesizes the corresponding motion fields. We show that the model can be trained end-to-end with a self-supervised loss that enforces consistency of the predicted motion fields with the input frames yields a system that provides highly accurate estimates of camera ego-motion. We measure the effectiveness of our method using TUM [13] and Virtual KITTI [4] dataset.

Relative to other recent papers [14, 10, 11] that have also investigated self-supervision for structure-from-motion, the novel contributions of our work are:

  • [label=]

  • We represent camera motion implicitly in terms of motion fields and depth which are a better match for CNNs architectures that naturally operate in the image domain (rather than camera parameter space). We demonstrate that this choice yields better predictive performance, even when trained in the fully supervised setting

  • Unlike previous self-supervised techniques, our model uses a continuous (linearized) approximation to camera motion [15, 16]

    which suitable for video odometry and allows efficient backpropagation while providing strong constraints for learning from unsupervised data.

  • Our experimental results demonstrate state-of-the-art performance on benchmark datasets which include non-rigid scene motion due to dynamic objects. Our model improves on substantially on estimates of camera rotation, suggesting this approach can serve well as a drop-in replacement for local estimation in existing RGB(D) SLAM pipelines.

Figure 1: Overview of network architectures used in our experiments. The top panel shows conventional (baseline) approach that directly predicts 6DoF camera motion. The middle panel displays our proposed single layer model which predicts ego motion assuming a static (rigid) environment. Note that our model supports both supervised (red) and unsupervised (green) losses during training. The bottom panel shows a two layered model variant that segments a scene into static and dynamic components and only uses static component for camera motion prediction. When input depth is not available, we utilize an additional monocular depth estimation network to predict it (not shown).

2 Related Work

Visual odometry is a classic and well studied problem in computer vision. Here we mention a few recent works that are most closely related to our approach.

Optical Flow, Depth and Odometry: A number of recent papers have shown great success in estimation of optical flow from video using learning-based techniques [17, 18]. Ren et al.

introduced unsupervised learning for optical flow prediction 

[19] using photometric consistency. Garg et al. utilize consistency between stereo pairs to learn monocular depth estimation in a self-supervised manner [20]. Zhou et al. [11] jointly trains estimators for monocular depth and relative pose using an unsupervised loss. SfM-Net [10] takes a similar approach but explicitly decomposes the input into multiple motion layers.  [21] uses stereo video for joint training of depth and camera motion (sometimes referred to as scene flow) but tests on monocular sequences. Our approach differs from these recent papers in using a continuous formulation appropriate for video. Such a formulation was recently used by Jaegle et al.[16] for robust monocular ego-motion estimation but using classic (sparse) optical flow as input.

SLAM: While conventional simultaneous localization and mapping (SLAM) methods estimate geometric information by extracting feature points [22, 23] or use all information in the given images [24], recently several learning-based methods have been introduced. Tateno et al. [25] propose a fusion SLAM technique by utilizing CNN based depth map prediction and monocular SLAM. Melekhov et al. propose CNN based relative pose estimation using end-to-end training with a spatial pyramid pooling (SPP) [26]. Other recent works [27, 28] model static background to predict accurate camera pose even in dynamic environment. Sun et al. try to solve dynamic scene problem by adding motion removal approach as a pre-processing to be integrated into RGBD SLAM [29]. Finally, the work of Wang et al. [30] train a recurrent CNN to capture longer-term processing of sequences typically handled by bundle adjustment and loop closure.

3 Continuous Ego-motion Network

Figure 1 provides an overview of three different types of architectures we consider in this paper. We take as input a successive pair of RGB images and corresponding depth images . When depth is not available, we assume it is predicted by a monocular depth estimator (not shown). The first network, , directly predicts 6 DoF camera motion by attaching several fully connected layers at the end of several CNN layers. When camera motion is known, this baseline can be trained with a supervised loss or trained with a self-supervised image warping loss as done in several recent papers  [11, 14, 10].

Instead of directly predicting camera motion, we advocate utilizing a fully-convolutional encoder/decoder architecture with skip connections (e.g.,  [31, 32, 17, 18]) to first predict optical flow (denoted ). We then estimate continuous ego-motion using weighted least-squares and resynthesize the corresponding motion field . These intermediate representations can be learned using unsupervised losses (,, ) described below. When additional moving objects are present in the scene, we introduce an additional segmentation network, , which decomposes the optical flow into layers which are fit separately.

In the following sections we develop the continuous motion formulation, interpret our model as projecting the predicted optical flow on to the subspace of ego-motion flows, and discuss implementation of segmentation into layers.

3.1 Estimating Continuous Ego-motion

Consider the 2D trajectory of a point in the image as a function of its 3D position and motion relative to the camera. We write

where is the camera focal length. To compute the projected velocity in the image as a function of the 3D velocity we take partial derivatives. For example, the component of the velocity is:

Dropping for notational simplicity, we can thus write the image velocity as:


where the matrix is given by:

In the continuous formulation, the velocity of the point relative to the camera arises from a combination of translational and rotational motions,

where is unit length axis representation of rotational velocity of the camera and is the translation. Denoting the inverse depth at image location by

, we can see that the projected motion vector

is a linear function of the camera motion parameters:

where the matrix includes the cross product

To describe motion field for the whole image, we concatenate equations for all pixel locations and write where

We assume the focal length is a fixed quantity and in the following write the motion field as a function which is linear in both the inverse depths and camera motion parameters .

To infer the camera motion given inverse depths and image velocities , we use a least-squares estimate:

where is a weighting function that models the reliability of each pixel velocity in estimating the camera motion. The solution to this problem can be expressed in closed from using the pseudo inverse of matrix . We denote the mapping from to estimated camera motion as .

In our model we utilize to estimate camera model and to resynthesize the resulting motion field. Both functions are differentiable with respect to their inputs (in fact linear in and respectively) making it straightforward and efficient to incorporate them into a network that is trained end-to-end using gradient-based methods.

(a) Predict pose directly     (b) Predict pose via flow space (c) losses comparison
Figure 2:

Schematic interpretation of different loss functions. (a) Supervised training of direct models utilize a loss defined on camera pose space. (b) Our approach defines losses on the space of pixel flows and considers losses that measure the distance to the true motion field, the sub-space of possible ego-motion fields (blue), and its orthogonal complement (gray dashed). The model is also guided by photometric or scene-flow consistency between input frames (yellow) (c) shows prediction error for supervised models trained with different combinations of these losses and indicates that using losses defined in flow-space outperforms direct prediction of camera motion.

(a) (d) Optical Flow (g) (j)
(b) (e) (h) (k)
(c) (f) (i) (l)
Figure 3: A sample result on a dynamic sequence from TUM [13]. From an input frame pair (a) and (b), predicts optical flow (d). Both camera and object motion are visible in the frame difference (c). A single motion field (e) is dominated by large object motions and yields poor warping error (f), particularly on the background. Our model includes a segmentation network that divides the image into dynamic and static masks (g,j) and fits corresponding motion fields (h,k). These provide better warping error on the objects (i) and background (l) respectively.

3.2 Projecting optical flow onto ego-motion

Given the true motion field , it is straight forward to estimate the the true camera motion . In practice, the motion must be estimated from image data which is often ambiguous (e.g., due to lack of texture) and noisy. Typically there is a large set of image flows that are photometrically consistent from which we must select the true motion field. Our architecture utilizes a CNN to generate an initial flow estimate from image data, then uses to fit a camera motion and finally reconstructs the image motion field corresponding to the camera motion. The composition of and can be seen as a linear projection of the initial flow estimate into the space of continuous motion fields.

A key tenet of our approach is that it is a better match to the capabilities of a CNN architecture to predict the ego motion field in the image domain (and subsequently map to camera motion) rather than attempting to directly predict in the camera pose space. In particular, this allows for richer loss functions that guide the training of the network. We illustrate these idea schematically for the case of supervised learning in Figure 2. Panel (a) depicts the direct approach in terms of a loss function whose gradient pulls the predicted pose towards the true pose.

We display the relationship between optical flow, motion field and camera pose in Figure 2(b). Among all possible image flows , we indicate in yellow the set which are photometrically valid (i.e., have a zero warping loss ). The blue line indicates the 6-dimensional subspace consisting of those motion fields that can be generated by all possible camera velocities (conditioned on scene depth). Introducing a loss on the camera pose (either directly on the prediction , or on the resynthesized motion field serves to pull the flow prediction towards the orthogonal complement of this space (i.e., the set denoted by the gray vertical line).

Our approach allows the consideration of two other loss functions that can provide additional guidance. When supervision is available, we can introduce a loss which directly measures the distance between the predicted flow and the true motion field ( in the figure). In the self-supervised setting, we can approximate this with the photometric warping loss . Additionally, in either supervised or unsupervised settings, we can include an orthogonal projection loss , which encourages the model to predict flows which are close to the space of motion fields. In section , we describe how these losses are computed and adapted to the unsupervised setting.

While all of these losses are minimized in a perfect model, Figure 2(c) shows that this choice of loss during training as a substantial practical effect. In the supervised setting, optimizing the direct loss in the camera pose space (using generic fully connected layers), or in the flow space (using our least-squares fitting) results in similar prediction errors. However, adding the projection loss or directly minimizing the distance to the true motion field yields substantially better predictions (i.e., halving average camera translation error).

3.3 Static and Dynamic Motion Layers

So far, our description has assumed a camera moving through a single rigid scene. A standard approach to modeling non-rigid scenes (e.g., due to relative motion of multiple dynamic objects in addition to ego-motion) is to split the scene into a number of layers where each layer has a separate motion model [33]. For example, Zhou et al. use a binary “explainable mask” [11] to exclude outlying motions, and Vijayanarasimhan et al. segment images into K regions based on motion [10]. However, in the later-case, there is no distinction between object motion and ego motion making it inappropriate for odometry.

We use a similar strategy in order to separate motion into two layers corresponding to static background and dynamic objects (outliers). We adopt a u-net-like segmentation network 

[31] to predict this separation which then defines the weights used for camera motion estimation using pseudo inverse function described in Section 3.1.

Consider a scene divided into regions corresponding to moving objects and rigid background. Let denote a mask that indicates the image support of region and denote the corresponding rigid motion field for that object considered in isolation. The composite motion field for the whole image can be written as:

In the odometry setting, we are only interested in the motion of the camera relative to static background. We thus collect any dynamic objects into a single motion field and consider a single binary mask:

In our training with this segmentation network, we use the approximated motion field for the photometric warping loss described below. For simplicity, we refer our single layer model as and dual layer model as

In  Figure 3, we illustrate intermediate results demonstrating how the two layered model can better estimate camera motion in the presence of dynamic objects. Since the single layer model cannot distinguish background and foreground, the quality of predicted camera pose is bad. Excluding the dynamic scene components from the camera motion estimation provides substantially better pose estimation as seen in panels (i) and (l) which show less photometric warping error on the scene background relative to the single layer model shown in (f).

Hard assignment to layers: Previous work such as  [10] uses a soft probabilistic prediction of layer membership (i.e., using a softmax function to generate layer weights). However, such an approach introduces degeneracy since it can utilize weighted combinations of two motions to match the flow (e.g., even in a completely rigid scene). We find that using hard assignment of motions to layers yields superior camera motion estimates. We utilize the “Gumbel sampling trick” described in [34] to implement hard assignment while still allowing differentiable end-to-end training of both the flow and segment networks.

4 Training Losses

4.1 Losses for Self-supervision

As described in Section 3.2, there are several different losses which can be applied to predicted flows. Here we adapt them to the self-supervised setting. The basic building block is to check if a predicted flow is photometrically consistent with the input image pairs.

For a given optical flow and source image we can synthesize warped image and check if it matches . As described in [35]

, this type of spatial transformation can be carried out in a differentiable framework using bilinear interpolation:

where denotes the bilinear weighting of the four sample points. For simplicity, we write to denote the warping of using flow . We then define the self-supervised flow loss using the photometric error over all pixels:

This loss serves as an approximation of when the predictions are far from the true motion field.

We can similarly apply warping loss to the reconstructed motion field rather than the initial prediction. If the motion field we found is correct, then again, the warped image should be matched with the target image. We can build motion field loss by using motion-field warped image as:

where the mask is 1 when the depth at is valid, 0 otherwise. This is necessary when using a depth sensor which doesn’t provide depths at every image location. This loss acts as a proxy for minimizing the camera motion estimation error by lifting the prediction back to the flow space. When we predict camera motion for static scene, we use global motion field, and for the dynamic scene, we use composite motion field .

(b) (c) Optical flow (d) Motion field (e)
Figure 4: Visualizations of our single layered model. Top three rows come from TUM [13] dataset and bottom three come from Virtual KITTI [4]. From the input images (a) and (b), the predicted flow, and recovered motion field are displayed in (c) and (d) respectively. Since motion field is derived from camera pose estimate, the error between and motion field based warped image reflects the accuracy of predicted camera motion. If the predicted camera pose and depth is ideal, then the error in (e) should be zero.
(a) Translation Error (b) Rotation Error
Figure 5: Camera motion error on held-out test data as a function of training set size for TUM (top) and Virtual KITTI (Bottom) RGBD datasets. The blue line denotes training a supervised model that can’t exploit unlabeled data. The introduction of self-supervised warping losses yields much better performance when either using only unsupervised training (yellow line) or semi-supervised training (green).

Finally, we can utilize the orthogonal projection loss to minimize the distance between predicted optical flow and its projection onto the space of motion fields via:

By combining three above losses, we can define the final self-supervised loss function

where , and weigh relative importance (we use 1, 0.1 and 0.1 respectively in our experiments).

4.2 Semi-supervision for symmetry breaking

In our segmentation network, we have two layers corresponding to static and dynamic parts. However, in the unsupervised setting, the loss is symmetric with respect to which is selected as background. This symmetry problem can interfere with training of the model and affect final performance. To break this symmetry, we found it most effective to utilize a small amount of supervised data where camera motion is known. For the supervised data we use an additional loss term on the camera parameters.

4.3 Camera Supervision from axis-angle representation

Our network predicts camera motion in an axis-angle representation that includes translation part and rotation . For supervised loss, we treat these two components separately in order to match the criteria typically used in benchmarking pose estimation performance.

We first convert the axis-angle representation to a rotation matrix using quaternions and then combine with the translation velocity to yield a transformation matrix . Following [13], we compute the difference between our predicted transformation and the ground truth and penalize the translation and rotation components respectively by:

Seq. DVO-SLAM Kintinuous ElasticFusion ORB2 CeMNet(RGBD) [22] [36] [23] [37] fr1/desk 0.021 0.037 0.020 0.016 0.0089 fr1/desk2 0.046 0.071 0.048 0.022 0.0129 fr1/room 0.043 0.075 0.068 0.047 0.0071 fr2/xyz 0.018 0.029 0.011 0.004 0.0009 fr1/office 0.035 0.030 0.017 0.010 0.0041 fr1/nst 0.018 0.031 0.016 0.019 0.0117 fr1/360 0.092 - - - 0.0088 fr1/plant 0.025 - - - 0.0061 fr1/teddy 0.043 - - - 0.0139
Table 1: Relative translation error on TUM [13] static dataset. Most of the methods in this table use RGBD frames camera for pose prediction.
Seq. TUM [13] SfM-Net [10] CeMNet(RGB) Trans Rot Trans Rot Trans Rot fr1/desk 0.008 0.495 0.012 0.848 0.0113 0.6315 fr1/desk2 0.099 0.61 0.012 0.974 0.0133 0.7548 fr1/360 0.099 0.474 0.009 1.123 0.0091 0.5455 fr1/plant 0.016 1.053 0.011 0.796 0.0083 0.5487 fr1/teddy 0.020 1.14 0.0123 0.877 0.0113 0.6460
Table 2: Comparison to RGB SLAM odometry. To compare with methods that only use RGB, we train our model using monocular depth prediction  [38] instead of input depth.
Method Seq.09 Seq.10 ORB-SLAM (full) 0.014 0.008 0.012 0.011 ORB-SLAM (short) 0.064 0.141 0.064 0.130 Mean Odom 0.032 0.026 0.028 0.023 Zhou et al. [11] 0.021 0.017 0.020 0.015 Ours 0.019 0.014 0.018 0.013
Table 3: Absolute Trajectory Error (ATE) comparison using KITTI dataset. For this comparison, we average errors over 5-frame snippets.
Training Testing GT Depth GT Cam GT Depth Trans Rot Geometric [16] - - 0.4579 0.3423 AIGN-SfM [14] 0.1247 0.3333 CeMNet(RGBD) 0.0878 0.0781 CeMNet(RGB) 0.0941 0.1079
Table 4: Relative pose error comparison using Virtual KITTI [4]. Both with (CeMNet(RGBD)) and without (CeMNet(RGB)) depth inputs, our models outperform previous methods.

5 Experimental Results

For the following experiments, we use the synthetic Virtual KITTI dataset [4] depicting street scenes from a moving car, and the TUM RGBD dataset [13] which has been used to benchmark a variety of RGBD odometry algorithms. To measure performance, we use relative pose error protocol proposed in [13].

Self-supervised learning improves model performance: To show the benefits of self-supervision, we assume that only 10% of each dataset has ground-truth available. We use 11 different sequences from the TUM dataset as training, choose a random ordering of frame pairs over the whole dataset and train models with increasingly large subsets of the data and test on a separate held-out collection of frames. This allows us to evaluate the effect of growing the amount of supervised/unsupervised training data in a consistent way across models.

In Figure 5, we plot the relative translation/rotation errors as a function of training data size. The supervised version of the model (CeM-Sup) can only be trained on the first 10% of the dataset and makes no use of the unsupervised data. In this setting it outperforms the unsupervised model (CeM-Unsup). However, as the amount of unsupervised training data continues to grow, CeM-Unsup eventually outperforms the supervised model. For a clear comparison, the unsupervised losses are not used in training (CeM-Sup). We also compare a model which uses both supervised and unsupervised loss (CeM-SemiSup) which generally yields even better performance. We note that because the real world depth data in TUM is incomplete, limiting performance of the supervised model while the supervised model shows expected decreasing errors on Virtual KITTI.

Motion field and warping: In Section 4.1, we describe how a predicted camera pose is used to generate motion field and used in the warping loss. In Figure 4, we plot the per-pixel warping loss for several inputs. Left two (a-b) show the input RGB frames, (c) shows predicted optical flow. (d) is regenerated motion field. (e) shows differences between the target image and warped image. Note that blue color means lower differences between those two images.

Camera motion error comparison: To measure the quality of predicted camera pose, we compare our single layer model (CeMNet) with previous RGBD SLAM methods on the TUM dataset in Table 1. CeMNet(RGBD) shows the best average performance among tested methods in terms of relative translation error. Several previous methods of interest, including [11, 10] do not utilize depth as an input, instead predicting it directly from input images.

For fair comparison, we also test our model with predicted depth (CeMNet(RGB)) using off-the-shelf the monocular depth prediction method introduced by Iro et al. [38]. This model was pretrained using NYU Depth dataset V2 [39]. We rescale the predictions by 0.9 to match the range of depths in TUM (presumably due to differences in focal length) but otherwise leave the model fixed. focal length for TUM. As shown in Table 2, our method continues to outperform others in terms of rotation and shows comparable translation errors. As another comparison, we use KITTI [40] dataset for absolute trajectory error in Table 4. For training, we use sequence from 00 to 08, and use 09 and 10 for each evaluation.

Additionally, we show performance on the Virtual KITTI dataset in Table 4. We specify how each method uses the available ground truth depth and camera pose data available for train and test. Using the true depth at test time results in strong performance from our model. For fair comparison, we also evaluate our model using the monocular depth prediction model of [41] pretrained with KITTI [40] dataset and converted from the predicted disparity to depth111We use 0.54 as baseline distance and 725 for focal length. The results show better performance than previous self-supervised approaches even without using ground-truth depth.

(a) (b) Optical (c) (c) (d) (e) (f)
    flow     (all)     (all)      (static)      (static)      (static)
Figure 6: Intermediate results of two layered model for dynamic scene camera pose prediction. Without separating static and dynamic components, it is difficult to get good camera motions (high error in (c)). However, as shown in (f), it is possible to predict camera motion for background by fitting only the static segment (d).
Seq. Baseline (Semi)
Trans Rot Trans Rot Trans Rot Trans Rot
fr3/sit_static 0.0134 0.5724 0.0025 0.1667 0.0016 0.1573 0.0010 0.1527
fr3/sit_xyz 0.0179 0.7484 0.0070 0.2645 0.0068 0.2653 0.0064 0.2612
fr3/sit_halfsph 0.0104 1.0135 0.0081 0.5272 0.0080 0.5820 0.0074 0.5552
fr3/walk_static 0.0149 0.5703 0.0103 0.2107 0.0030 0.1610 0.0019 0.1583
fr3/walk_xyz 0.0174 0.7952 0.0128 0.3338 0.0079 0.2915 0.0078 0.2921
fr3/walk_halfsph 0.0166 0.9426 0.0147 0.4698 0.0107 0.4120 0.0102 0.3989
Table 5: Relative pose error comparison using TUM dynamic dataset [13]. Generally, the two layered model shows better performance than single layered model. Including a small amount of supervision () yields equivalent or better performance depends on dataset by breaking the symmetry of the unsupervised loss.

Static/Dynamic segmentation: In Figure 6, we visualize the results of breaking the input into static and dynamic layers. From the RGB input pair at (a) and , predicted optical flow is shown in (b). While single layered model generates motion field using the complete flow (c), two layered model focuses on static region (d) and generates motion field by only using it (e). The warping error from the total flow (c) is higher than (f) especially in background region.

We perform a quantitative comparison on the TUM dynamic dataset which includes both object and camera motion. The results results are shown in Table 5. While single layered models such as the baseline direct prediction model and are sensitive to dynamic objects, two layered model shows less pose error. However, as noted previously, the unsupervised loss suffers from a symmetry as to which layer correspond to ego-motion. We evaluate the use of a small amount of supervised data (10%) to break this symmetry in the segmentation prediction network. This yields the the lowest resulting motion errors across nearly all test sequences.

6 Conclusion

In this paper, we have introduced a novel self-supervised approach for ego-motion prediction that leverages a continuous formulation of camera motion. This allows for linear projection of flows into the space of motion fields and (differentiable) end-to-end training. Compared to direct prediction of camera motion (both our own baseline implementation and previously reported performance), this approach yields more accurate two-frame estimates of camera motions for both RGBD and RGB odometry. Our model makes effective use of self-supervised training, allowing it to make effective use of “free” unsupervised data. Finally, by utilizing a two-layer segmentation approach makes the model further robust to the presence of dynamic objects in a scene which otherwise interfere with accurate ego-motion estimation.

Acknowledgements: This project was supported by NSF grants IIS-1618806, IIS-1253538 and a hardware donation from NVIDIA.


  • [1] Handa, A., Bloesch, M., Pătrăucean, V., Stent, S., McCormac, J., Davison, A.: gvnn: Neural network library for geometric computer vision. In: ECCV, Springer (2016) 67–82
  • [2] Huang, Z., Wan, C., Probst, T., Gool, L.V.: Deep learning on lie groups for skeleton-based action recognition. CVPR (2017) 1243–1252
  • [3] Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: ECCV. (2012) 611–625
  • [4] Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR. (2016)
  • [5] Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. CVPR (2017)
  • [6] Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(2) (2016) 295–307
  • [7] Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV. (2016)
  • [8] Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.: Context encoders: Feature learning by inpainting. (2016)
  • [9] Tung, H., Wei, H., Yumer, E., Fragkiadaki, K.: Self-supervised learning of motion capture. In: NIPS. (2017)
  • [10] Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: Sfm-net: Learning of structure and motion from video. CoRR (2017)
  • [11] Zhou, T., Brown, M., Snavely, N., Lowe, D.: Unsupervised learning of depth and ego-motion from video. In: CVPR. (2017)
  • [12] Janner, M., Wu, J., Kulkarni, T., Yildirim, I., Tenenbaum, J.B.: Self-Supervised Intrinsic Image Decomposition. In: NIPS. (2017)
  • [13] Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: IROS. (Oct. 2012)
  • [14] Tung, H.F., Harley, A.W., Seto, W., Fragkiadaki, K.:

    Adversarial inverse graphics networks: Learning 2d-to-3d lifting and image-to-image translation from unpaired supervision.

    ICCV (2017)
  • [15] Pajdla, T., Matas, J., eds.: The Least-Squares Error for Structure from Infinitesimal Motion. In Pajdla, T., Matas, J., eds.: ECCV. (2004)
  • [16] Jaegle, A., Phillips, S., Daniilidis, K.: Fast, robust, continuous monocular egomotion computation. In: 2016 IEEE International Conference on Robotics and Automation (ICRA). (May 2016) 773–780
  • [17] Dosovitskiy, A., Fischery, P., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., d. Smagt, P.v., Cremers, D., Brox, T.: FlowNet: Learning optical flow with convolutional networks. In: ICCV. (December 2015) 2758–2766
  • [18] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. CVPR (2016)
  • [19] Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning for optical flow estimation. In: AAAI. (2017)
  • [20] Garg, R., B.G., V.K., Carneiro, G., Reid, I. In: Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. Springer International Publishing (2016) 740–756
  • [21] Li, R., Wang, S., Long, Z., Gu, D.: UnDeepVO: Monocular visual odometry through unsupervised deep learning. arXiv (2017)
  • [22] Kerl, C., Sturm, J., Cremers, D.: Dense visual slam for rgb-d cameras. In: Proc. of the Int. Conf. on Intelligent Robot Systems (IROS). (2013)
  • [23] Whelan, T., Leutenegger, S., Moreno, R.S., Glocker, B., Davison, A.: Elasticfusion: Dense slam without a pose graph. In: Proceedings of Robotics: Science and Systems. (2015)
  • [24] Engel, J., Cremers, D.: Lsd-slam: Large-scale direct monocular slam. In: ECCV. (2014)
  • [25] Tateno, K., Tombari, F., Laina, I., Navab, N.: Cnn-slam: Real-time dense monocular slam with learned depth prediction. CVPR (2017)
  • [26] Melekhov, I., Kannala, J., Rahtu, E.: Relative camera pose estimation using convolutional neural networks. arXiv (2017)
  • [27] Kim, D.H., Kim, J.H.: Effective background model-based rgb-d dense visual odometry in a dynamic environment. IEEE Transactions on Robotics 32(6) (Dec 2016) 1565–1573
  • [28] Li, S., Lee, D.: Rgb-d slam in dynamic environments using static point weighting. IEEE Robotics and Automation Letters 2(4) (2017) 2263–2270
  • [29] Sun, Y., Liu, M., Meng, M.Q.H.: Improving rgb-d slam in dynamic environments: A motion removal approach. Robotics and Autonomous Systems 89 (2017) 110 – 122
  • [30] Wang, S., Clark, R., Wen, H., Trigoni, N.: Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: ICRA. (2017) 2043–2050
  • [31] Ronneberger, O., P.Fischer, Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). Volume 9351. (2015) 234–241
  • [32] Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. TPAMI 39(4) (April 2017)
  • [33] Weiss, Y.: Smoothness in layers: Motion segmentation using nonparametric mixture estimation. In: CVPR. (Jun 1997) 520–526
  • [34] Veit, A., Belongie, S.J.: Convolutional networks with adaptive computation graphs. CoRR (2017)
  • [35] Jaderberg, M., Simonyan, K., Zisserman, A., kavukcuoglu, k.: Spatial transformer networks. In Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., eds.: Advances in Neural Information Processing Systems 28. Curran Associates, Inc. (2015) 2017–2025
  • [36] Whelan, T., McDonald, J., Kaess, M., Fallon, M., Johannsson, H., Leonard, J.J.: Kintinuous: Spatially extended kinectfusion. In: RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras. (July 2012)
  • [37] Mur-Artal, R., Tardós, J.D.: Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics 33(5) (2017) 1255–1262
  • [38] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3D Vision (3DV), 2016 Fourth International Conference on, IEEE (2016) 239–248
  • [39] Nathan Silberman, Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV. (2012)
  • [40] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR. (2012)
  • [41] Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR. (2017)