Moving SLAM: Fully Unsupervised Deep Learning in Non-Rigid Scenes

05/05/2021 ∙ by Dan Xu, et al. ∙ University of Oxford The Hong Kong University of Science and Technology 11

We propose a method to train deep networks to decompose videos into 3D geometry (camera and depth), moving objects, and their motions, with no supervision. We build on the idea of view synthesis, which uses classical camera geometry to re-render a source image from a different point-of-view, specified by a predicted relative pose and depth map. By minimizing the error between the synthetic image and the corresponding real image in a video, the deep network that predicts pose and depth can be trained completely unsupervised. However, the view synthesis equations rely on a strong assumption: that objects do not move. This rigid-world assumption limits the predictive power, and rules out learning about objects automatically. We propose a simple solution: minimize the error on small regions of the image instead. While the scene as a whole may be non-rigid, it is always possible to find small regions that are approximately rigid, such as inside a moving object. Our network can then predict different poses for each region, in a sliding window. This represents a significantly richer model, including 6D object motions, with little additional complexity. We establish new state-of-the-art results on unsupervised odometry and depth prediction on KITTI. We also demonstrate new capabilities on EPIC-Kitchens, a challenging dataset of indoor videos, where there is no ground truth information for depth, odometry, object segmentation or motion. Yet all are recovered automatically by our method.



There are no comments yet.


page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is a long-standing goal of computer vision to achieve a holistic understanding of a visual scene: that is, to decompose it into meaningful elements that together explain the full visual input 

hartley2003multiple . This goal is also at the heart of representation learning, which is concerned with extracting representations from data that generalize well for multiple tasks zamir2018taskonomy ; xu2018PAD-Net

. For example, a representation that was trained for object detection will invariably ignore details that are crucial for other tasks that are not object-centric, such as monocular depth estimation. As such, it is desirable to learn models that are not narrowly-scoped to a single task 

caruana1997multitask , though it is not always clear how to do so without an increased annotation burden for each additional task.

A recent line of work that promises to achieve both goals at once — a holistic scene understanding without demanding additional annotations — is unsupervised learning by view synthesis 

garg2016unsupervised ; zhou2017unsupervised

. It cleverly combines two predictions (i.e. tasks), relative camera pose estimation and depth estimation, to re-render an image from another point-of-view. By synthesizing a source frame from a video into another target frame, we obtain the supervision necessary (by comparing the synthetic and the real image) to enable end-to-end learning.

It is natural to ask if more tasks can be integrated into this framework, and intuitively the answer is yes. Each additional task that contributes to the image synthesis, by capturing an additional element of the visual world, should increase the fidelity of the model and achieve a lower reconstruction error. In this work, we propose to add the unsupervised tasks of object segmentation, and object 6D motion estimation. Since our primary interest is still the recovery of 3D geometry and motion, however, we do not seek, as done by some prior work casser2019depth ; tosi2020distilled ; guizilini2020semantically , “semantic” objects. Instead, we decompose the image into regions that are likely to be characterized by a well defined rigid motion, and learn those automatically, by optimizing for the same view synthesis objective. Our technical contribution is a locally-rigid model that supports crisp object boundaries through segmentation, and its efficient implementation by reusing a commonly-available but underused “tiling” operator. This results in an efficient lightweight model that transforms the original rigid-world model (where changes between frames are characterized only by a relative camera pose and depths) zhou2017unsupervised to a non-rigid model (where moving objects are also taken into account).

Experiments show that our proposed model is much more expressive, by making predictions that are outside the capabilities of the original model (namely segment moving objects and calculate their 6D motions). We show qualitative results of our neural networks successfully learning to segment non-rigid objects (hands and household objects), and recover accurate depth maps, in EPIC-Kitchens 

Damen2020Collection , a large-scale indoors dataset that has not been used in this context before due to its challenging nature. We also demonstrate that the unsupervised segmentation cues and non-rigid model are beneficial for the previously-considered tasks. Our model achieves a new state-of-the-art result on KITTI Geiger2013IJRR in monocular depth estimation (88.9% accuracy at ), as well as visual odometry (0.011/0.010 ATE), from purely unsupervised data (see sec. 5 for details).

The paper is organized as follows: sec. 2 discusses relevant literature; sec. 3 introduces the view synthesis framework; sec. 4 presents our method, and sec. 5 the results; sec. 6 concludes our paper.

2 Related work

Self-supervised learning 

caruana1997promoting ; self-supervised-survey2019

has gathered significant attention recently, since it promises to achieve unsupervised learning by reusing standard elements from supervised learning (e.g. architectures, loss functions), in relatively intuitive configurations. It can generally be described as the task of predicting one part of the input data given only another part. Successful examples include predicting the spatial relationship between two image regions 


, colorizing images 

zhang2016colorful , predicting audio from video nagrani2018seeing ; owens2016ambient , and predicting inertial measurement unit (IMU) data (odometry) from video agrawal2015learning . A related approach is to use synthetic transformations, such as predicting the rotation of an image gidaris2018unsupervised or whether a video is played in reverse wei2018learning . Given the large body of literature on self-supervised learning, we refer interested readers to a recent survey self-supervised-survey2019 . Within self-supervised learning, video prediction is a relatively popular task. It consists of generating a subset of a video given the remaining video. Several works have focused on video generation with neural networks srivastava2015unsupervised , by proposing multiple-hypothesis loss functions rupprecht2017learning , causal convolutions such as the PixelCNN for video kalchbrenner2017video , and generative models such as adversarial networks or variational auto-encoders videogen2019 . While successful, these approaches often produce blurry or physically-implausible predictions, and the image generation models are not interpretable.

Approaching this problem by view synthesis zhou2017unsupervised is relatively recent, but draws from traditional works in visual geometry and Simultaneous Location And Mapping (SLAM) scaramuzza2011visual ; davison2007monoslam ; Sheng2019Unsupervised . Instead of using a neural network for generation, it uses a differentiable warp (image deformation) to transform a source frame into a target frame, minimizing the reconstruction error. The task of the neural network is then to predict physically-interpretable quantities, such as 3D geometry, depth and poses, that are used to compute the image warp. This physical model of the world is useful in itself, and can be used for downstream tasks (such as visual odometry or 3D reconstruction). This line of work was pioneered by Garg et al. garg2016unsupervised , who considered stereo pairs or pairs of frames with known pose. This was further developed to include predicted poses by Zhou et al. zhou2017unsupervised (SfMLearner), and concurrently by Vijayanarasimhan et al. sfmnet2017 (SfM-Net), who also considered multiple layers to account for objects. The SfMLearner emphasized no supervision (as opposed to mixed modes of supervision like SfM-Net), and a simple architecture, so it forms the basis of our work. We describe it in detail in sec. 3. Further developments include improving the pose estimate by a direct visual odometry method (essentially second-order gradient descent) wang2018learning , and adding a FlowNet that can refine the image warp by fine-grained optical flow estimates yin2018geonet . Other proposed improvements are stereo inputs godard2017unsupervised ; li2018undeepvo and probabilistic outputs klodt2018supervising , bundle adjustment to warp stored key frames instead of recent frames yang2018deep , adversarial training pilzer2018unsupervised , and feature computation in 3D space guizilini2019packnet .

3 Unsupervised learning by view synthesis

We begin by describing the canonical self-supervised setup for learning monocular depth estimation, inspired by early methods such as SfMLearner zhou2017unsupervised (sec. 2). Assume that we are given a pair of images , respectively the source and target, usually extracted as nearby frames from a video. If the scene is Lambertian and unchanged between frames (rigid-world assumption) and if we discount occlusions, the target image can be predicted from knowledge of the source image , the depth map , and the relative camera pose between the two views:


Synthesizing the target image amounts to using projective geometry hartley2003multiple to find which pixels correspond in the two views and then transport their intensities from source to target. Namely, the synthesis function

is a warp that linearly interpolates each pixel of

at coordinates to according to the projective equations:


where is the camera intrinsics matrix, retrieves the depth at pixels , and is the transformation matrix that translates any input 3D point along the axis by this depth. We can extract a supervisory signal for depth and pose by comparing the measured and predicted target image:


where is a training dataset of image pairs, is a mask that tells which pixels can be explained by the model and , i.e. the norm weighted by a mask .

Of course, the pose , depth and mask must be obtained somehow. This is the role of the pose, depth and mask networks:


with parameters . The last layer of the mask network is a sigmoid, to constrain outputs to the range. Back-propagating the error in eq. 3 to the networks’ parameters is what allows end-to-end learning. Since predicting is a trivial minimum of eq. 3, a regularization term is added, penalizing the distance between and a target of 1:111The regularization loss used in zhou2017unsupervised was the cross-entropy, which is not a material difference.


The overall objective is then , with a regularization weight .


Although this is the main objective, Zhou et al. zhou2017unsupervised mention a few details that improve performance: (1) a smoothness loss (TV-norm), which penalizes the L1 norm of the second-order gradients of the depth maps ; (2) repeating the objective for multiple scales of the input image; (3) using several source images to predict their poses jointly, instead of one at a time.

4 Proposed method

As we mentioned in sec. 1, the main limitation of the method from sec. 3

is that it assumes a rigid world. Our main goal then is to augment it to also account for freely-moving, potentially non-rigid objects. Note that since most pixels in common scenes correspond to a static background, it is undesirable to have a fully non-rigid model, as that would afford too many degrees-of-freedom and thus be prone to overfitting (a hypothesis that we verify in sec. 

5). For this reason, we segment the pixels into 2 categories: background (with a global rigid model), and objects (with a local, non-rigid model). Together, they fully explain all pixels of the image. The required segmentation network is trained unsupervised as part of the overall objective, and so object segmentation is obtained “for free”. We will now describe these three elements: non-rigid and rigid models, and segmentation.

Figure 1: Overview. Top panel: From a pair of video frames, CNNs predict 3 maps: depth, local 6D motion, and foreground segmentation. Higher depths are darker; the axes of motion translation and rotation (XYZ) are encoded as RGB channels; and foreground segmentation is white. Bottom-left: We model background pixels as a rigid object, with global 6D motion obtained by averaging their predicted motions. A L1 reconstruction error w.r.t. target image allows learning. Bottom-right: Foreground pixels are similar, but only locally-rigid; inputs are passed through a tiling operator, dividing them into small patches. This allows motions per object/patch, unlike the fully rigid model.

4.1 Locally-rigid scene model

The starting point for our method is the simple fact that, although a scene is not globally rigid, it usually is locally rigid. It is always possible to make the rigid-world assumption essentially correct, by narrowing down the view to a rigid region (e.g. background, a rigid object, or a smaller region within a non-rigid object). This seems to suggest that the previous method (sec. 3) can express a non-rigid model, by focusing on smaller regions of the image at a time instead of full images. We can then predict a different relative pose prediction for each region, instead of a single pose . We thus propose to average the objective (eq. 3) over a set of regions , extracted with a sliding window:


where we element-wise multiply () the mask with an indicator function , which is 1 inside the region and 0 outside of it. denotes the corresponding pose, extracted from a pose map at the center of the region. The dense map of pose predictions fits naturally into the CNN-based architecture, in the same way as the depth predictions .

While it achieves our goal of a locally-rigid model, eq. 6 is very inefficient (by summing over many zeros), and an efficient implementation would require customizing the differentiable warp operator . However, we can ensure an easy implementation with no modification to by using an operator that extracts patches in a sliding window, and concatenates them as samples before they are passed on to . This tiling operator, also known as im2row, is used in some implementations of convolutions.222

One example of its use for convolutions is in the Caffe deep learning library. In PyTorch, for example, it is implemented as the

unfold operation. We can write it succinctly, for an input with channels and spatial dimensions , as :



is the tensor slice operator, the patches are

, and the stride is

. Eq. 6 then becomes:


which effectively amounts to applying the tiling operator to all image-sized inputs of the original rigid-model objective (eq. 3

), and vectorizing the pose map

so that the spatial dimensions correspond to the batch dimension (). Since the kernel size is unknown, in practice we repeat the objective for several values of , corresponding to different object sizes. Our proposed inclusion of the tiling operators is illustrated in fig. 1.

4.2 Object segmentation

As discussed in sec. 4, a fully non-rigid model contains too many degrees-of-freedom. Another issue is that regions defined by square sliding windows (eq. 6) are too coarse to accurately delineate objects’ boundaries. We propose to solve both issues at once by using the predicted mask to partition the pixels into moving objects () and static background ():


where is the background pose, obtained by averaging the pose of all background pixels. Unfortunately, since the non-rigid model is more expressive than the rigid one, assigning all pixels to the former will always attain a lower error. To prevent this trivial solution, we modify the regularization term (eq. 3) to encourage a constant area of foreground pixels in a training batch. We consider that the top 10% of the predictions correspond to pixels with moving objects (), and the rest to background (), penalizing the distance to these target values:


where sort operates in descending order, and is a vector where the first 10% of the elements are 1 and the rest are 0. This “constant area” soft constraint is inspired by a similar approach used for visualizing salient regions in deep network interpretability fong2019understanding , and we found it to be an effective strategy to ensure a correct proportion of object and background pixels. Our overall objective is then , in analogy to sec. 3. An illustration of the objective is shown in fig. 1.

5 Experiments

We conducted a series of experiments to validate the effectiveness of our approach. The tasks we focused on were visual odometry, monocular depth estimation, and 6D motion segmentation.


We evaluate the proposed approach on two different large-scale datasets. The first one is the challenging autonomous driving benchmark KITTI Geiger2013IJRR . For the depth estimation, we use the training split defined by Eigen et al. eigen2014depth . For evaluating the performance of the visual odometry, we use the KITTI Odometry dataset, training on sequences and testing on . For the second dataset, we used EPIC-Kitchens Damen2020Collection , which is collected under various indoor kitchen scenarios, and is the largest dataset for egocentric vision. It contains 32 kitchens crossing 4 cities, totalling 55 hours of video. It captures rich non-rigid dynamic motions. Some examples are shown in Fig. 4. As the original frame rate of the videos is 60 FPS, to reduce the redundancy, we sample the dataset at every 4 frames, resulting in a dataset of around 120k images. Among them, 100k images are used for training, and the rest for testing. The dataset does not provide ground-truth depth, camera poses and intrinsics. We learn to recover all of them by unsupervised end-to-end learning.

Training setup.

Our training procedure is exactly the same as for the SfMLearner zhou2017unsupervised , except with the non-rigid model we propose (sec. 4). We used a ResNet-50 as the backbone CNN, and apply the same TV-norm (sec. 3) to both depth () and pose () maps. The only other improvements are depth mean normalizaton and backbone initialization according to Gordon et al. gordon2019depth , as well as disabling the multi-scale prediction, which does not seem beneficial. For brevity, we do not describe the full setup here, but a self-contained description can be found in the supp. material (Appendix A).

Figure 2: Visualisation of depth estimates for our method, including an ablation without the non-rigid component, and two other methods, including one with access to stereo information godard2017unsupervised . Our method seems more accurate, capturing moving cars and fine details such as traffic signs and poles.
Method Absolute Trajectory Error sequence 09 sequence 10 Mean Odometry 0.032 0.026 0.028 0.023 ORB-SLAM (short) 0.064 0.141 0.064 0.130 ORB-SLAM (full) 0.014 0.008 0.012 0.011 Zhou et al. zhou2017unsupervised 0.021 0.017 0.020 0.015 DF-Net zou2018df 0.017 0.007 0.015 0.009 Monodepth2 gordon2019depth 0.017 0.008 0.015 0.010 Zhou et al. zhou2017unsupervised (new) 0.016 0.009 0.013 0.009 Bian et al. bian2019unsupervised 0.016 0.007 0.015 0.015 Klodt et al. klodt2018supervising 0.014 0.007 0.013 0.009 Mahjourian et al. mahjourian2018unsupervised 0.013 0.010 0.012 0.011 EPC++ luo2018every 0.013 0.007 0.012 0.008 GeoNet yin2018geonet 0.012 0.007 0.012 0.009 CC ranjan2019competitive 0.012 0.007 0.012 0.009 Ours 0.011 0.005 0.010 0.007
Table 1: Absolute Trajectory Error (ATE) on the KITTI odometry test split averaged over all -frame snippets. Our approach outperforms all others, including a traditional SLAM pipeline.
(a) Testing sequence 09 (b) Testing sequence 10
Figure 3: Qualitative state-of-the-art comparison of the visual odometry results on the full testing sequences 09 and 10 of the KITTI odometry dataset.

Pose estimation performance.

Our sequential pose estimation (visual odometry) performance is reported in table 3. We compare our method with a traditional monocular SLAM system, ORB-SLAM (full) mur2015orb , as well as its local version, ORB-SLAM (short), which uses -frame snippets. We also compare with the SfMLearner zhou2017unsupervised and several recent proposals zou2018df ; mahjourian2018unsupervised ; yin2018geonet ; bian2019unsupervised ; klodt2018supervising ; ranjan2019competitive . As can be seen in table 3, our method outperforms all the other methods. This includes a traditional SLAM pipeline that draws from many years of careful engineering and manual tuning (ORB-SLAM). It is worth mentioning that our method is not explicitly trained for visual odometry, yet it is a useful by-product of training. Regarding the deep learning based approaches, our camera motion estimator outperforms the original SfMLearner zhou2017unsupervised

by a large margin, and recent competing approaches by narrower but still significant margins. We include the standard deviations over several runs in table 

3 for additional context.

Qualitative results on visual odometry.

We visualize the predicted camera trajectories in fig. 3a and 3b, for several algorithms, on two KITTI test sequences (09 and 10, respectively). All trajectories are registered w.r.t. the ground truth as standard zhan2018unsupervised . It is apparent that the trajectory predicted by our method very accurately follows the ground truth. While other methods can also get close to the ground truth trajectory (SC-SFM in fig. 3a and ORB-SLAM in 3b), no other can achieve the same performance simultaneously on both settings.

Depth estimation performance.

We report the result for depth estimation in table 2. The columns are relative error and its square, root-mean-squared error (RMSE) and its logarithm, as well as the accuracy at 3 given depth thresholds. We include several state-of-the-art approaches in the comparison, including supervised methods eigen2014depth ; liu2015deep and unsupervised stereo-based methods godard2017unsupervised ; garg2016unsupervised , which are not comparable but present an informative upper bound on performance. We do not include in the comparison works with significantly different protocols, such as 3DPackNet guizilini2019packnet which uses higher-resolution images and 3D CNNs, and other works with higher amounts of supervision. These interesting developments are complementary to ours, and thus outside of the scope of this paper. All methods are compared with raw ground truth depth, following recent protocols gordon2019depth ; guizilini2019packnet . We can observe that our method achieves the best performance out of all unsupervised methods, even beating several supervised ones, on almost all metrics. We show two variants of our method, using ResNet-18 and ResNet-50 backbone CNNs, which highlights that the performance gains are not solely due to an increase in capacity, since most methods have comparable backbones. On the other hand, it also reveals that our model is expressive enough to afford some gains when increasing the capacity to ResNet-50. More importantly, our method achieves clearly better results compared with two recent works, i.e. Struct2Depth casser2019depth and Tosi et al. tosi2020distilled , which use explicit semantic labels to guide the learning of object motion, demonstrating the benefits of unsupervised segmentation.

Figure 4: Qualitative examples of our unsupervised depth estimation on the EPIC-Kitchens dataset. Despite the fast, non-rigid motions, we can recover detailed structures, such as the tabletop objects.
Method Setting
Error (lower is better)
Accuracy (higher is better)
rel sq rel rmse rmse (log)
Eigen et al. eigen2014depth M + D 0.203 1.548 6.307 0.282 0.702 0.890 0.958
Liu et al. liu2015deep M + D 0.202 1.614 6.523 0.275 0.678 0.895 0.965
AdaDepth adadepth S 0.203 1.734 6.251 0.284 0.687 0.899 0.958
Garg et al. garg2016unsupervised S 0.169 1.080 5.104 0.273 0.740 0.904 0.962
Zhan et al. zhan2018unsupervised S 0.144 1.391 5.869 0.241 0.803 0.933 0.971
MS-CRF xu2018monocular M + D 0.125 0.899 4.685 - 0.816 0.951 0.983
Godard et al. et al. godard2017unsupervised S 0.124 1.076 5.311 0.219 0.847 0.942 0.973
Kuznietsov et al. kuznietsov2017semi S + D 0.113 0.741 4.621 0.189 0.862 0.960 0.986
Zhou et al. zhou2017unsupervised M + V 0.208 1.768 6.858 0.283 0.678 0.885 0.957
Yang et al. yang2018unsupervised M + V 0.182 1.481 6.501 0.267 0.725 0.906 0.963
Mahjourian mahjourian2018unsupervised M + V 0.163 1.240 6.220 0.250 0.762 0.916 0.968
Geonet (ResNet) yin2018geonet M + V 0.155 1.296 5.857 0.233 0.793 0.931 0.973
Wang et al. wang2017learning M + V 0.151 1.257 5.583 0.228 0.810 0.936 0.974
DF-Net zou2018df M + V 0.150 1.124 5.507 0.223 0.806 0.933 0.973
Struct2Depth casser2019depth M + V 0.141 1.026 5.291 0.215 0.816 0.945 0.979
CC (ResNet) ranjan2019competitive M + V 0.140 1.070 5.326 0.217 0.826 0.941 0.975
Bian et al. bian2019unsupervised M + V 0.128 1.047 5.234 0.208 0.846 0.947 0.976
Gordon et al. gordon2019depth M + V 0.128 0.959 5.230
Tosi et al. tosi2020distilled M + V 0.125 0.805 4.795 0.195 0.849 0.955 0.983
Monodepth2 gordon2019depth M + V 0.115 0.882 4.701 0.190 0.879 0.961 0.982
Ours (ResNet-18) M + V 0.105 0.889 4.780 0.182 0.884 0.961 0.982
Ours (ResNet-50) M + V 0.103 0.881 4.763 0.179 0.889 0.964 0.984
Table 2: Quantitative comparison of depth estimation performance among several methods from the literature, on the KITTI raw dataset (Eigen et al. eigen2014depth testing split). We show some supervised (denoted ‘D’) and stereo (‘S’) methods for reference, but a fair comparison is only with monocular methods (‘M’) trained with video (‘V’). Our method outperforms others in most metrics, even when using a ResNet-18 backbone with much fewer parameters.

Qualitative results on depth.

Fig. 2 shows a direct comparison of the depth estimates from our method, the most comparable baseline by Zhou et al. zhou2017unsupervised , and a stereo-based method garg2016unsupervised . Our method recovers much finer details, compared to a rigid-world model (4th column). It is also interesting to compare the level of detail with the stereo method (3rd column), which achieves better error metrics (table 2), but has relatively inconsistent fine details. Ours seems to strike a good balance between capturing the high-level layout and small-scale features. We also show qualitative results of our method in EPIC-Kitchens, for which there is no ground truth, in fig. 4. Despite the challenges of the quick camera and object motions in this setting, the quality of the recovered depths is apparent, making it possible to identify individual small objects and fine geometry.

Figure 5: Predicted 3D mesh and segmentations on the EPIC-Kitchens test set. Contours of constant height are shown as lines (higher values are brighter). Note the correct geometry of the sink (left panel) and bottle cap (right panel). A failure mode is also visible on the right panel (back wall is distorted in the top-right). Object masks are tinted red; normally corresponding to moving hands (also higher/nearer to the camera) or manipulated objects.
Method rel sq rel rmse SfMLearner zhou2017unsupervised 0.208 1.768 6.858 SfMLearner + our non-rigid model 0.173 1.302 6.252 + regularization (TV-norm) 0.165 1.301 6.225 Our rigid model (eq. 3) 0.121 1.115 5.285 Our non-rigid model (eq. 8) 0.108 0.899 4.788 + regularization (TV-norm) 0.107 0.895 4.786 + segmentation (eq. 9-10) (full system) 0.105 0.889 4.780
Table 3: Ablation study on monocular depth estimation (error). See sec. 5 for details.
Method seq. 09 (ATE) seq. 10 (ATE) SfMLearnerzhou2017unsupervised 0.021 0.017 0.020 0.015 SfMLearner + our non-rigid model 0.012 0.007 0.011 0.008 + regularization (TV-norm) 0.011 0.005 0.011 0.007 Our rigid model (eq. 3) 0.016 0.012 0.015 0.013 Our non-rigid model (eq. 8) 0.013 0.006 0.011 0.010 + regularization (TV-norm) 0.012 0.006 0.011 0.009 + segmentation (eq. 9-10) (full system) 0.011 0.005 0.010 0.007
Table 4: Ablation study on visual odometry (Absolute Trajectory Error and std. dev.). See sec. 5 for details.

Qualitative results on object discovery and motion prediction.

Since our proposed method can predict relative poses (6D motion) densely for the whole image and separate foreground pixels (with independent motions) from static background pixels (with a single coherent motion), it should be able to discover moving objects in the image automatically. We validate this idea by binarizing the mask

(with a threshold of 0.7), and overlaying it on the original data. The resulting segmentations in EPIC-Kitchens can be visualized in fig. 5, along with the projected 3D geometry. We can observe the correct segmentation and clustering of the hands, and of objects as they are being held. We show more visualizations of the motion prediction in the suppl. material (Appendix B).

Ablative study.

Although our model adds relatively few elements to view synthesis, it is important to know their relative impact on performance. We show the performance of our system with the addition of each element in tables 5-5. It is apparent that the biggest boost comes from adding our non-rigid model, especially in visual odometry performance. We also show a similar breakdown, but adding our improvements to the original SfMLearner system zhou2017unsupervised , which has a weaker backbone CNN and no depth normalization (see Appendix A for details). This shows that the benefits of our non-rigid model are complementary to other improvements, yet very significant.

6 Conclusion

We have proposed a simple approach to view synthesis for self-supervised learning, which leverages a successful rigid-world model and augments it with a locally-rigid model instead. This is done by strategic placement of tiling operators within the network, which is both efficient and produces highly consistent depth and pose estimates. Our change enables unsupervised object discovery by motion segmentation, and allows associating 3D pose and 6D motion to different objects in the scene. We also demonstrated a new state-of-the-art in unsupervised visual odometry and monocular depth estimation over comparable baselines. We hope that our contribution enables a new class of non-rigid SLAM algorithms, that continue the trend towards truly holistic scene understanding.

7 Broader impact

Our contributions improve the automatic understanding of 3D geometry, including camera and object motions through space. The most widespread applications of this technology nowadays are self-driving vehicles and robots meant to navigate human spaces (including indoors), which have a positive impact for society.

Given that the kind of perception we study has little semantic content, focusing instead on geometry, there is limited potential for harmful biases. It is conceivable that machines intended to cause harm on purpose could be built using computer vision technology, and that 3D environment perception would be one small part of such devices. This is a problem that we strongly believe that society as a whole needs to work through with appropriate regulations, and that in the case of 3D perception, the benefits (e.g. self-driving vehicle technology) outweigh the avoidable risks.

As for unintentional failure cases, we must point out that monocular depth estimation systems are inherently dependent on past experience to make accurate predictions in a given environment. As such, it is important that safety-critical systems (e.g. collision avoidance) do not depend solely on monocular estimates of depth since they may fail in unfamiliar situations; redundant systems (e.g. stereo, other sensors) are necessary.


  • (1) Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In ICCV, 2015.
  • (2) Jia-Wang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. Unsupervised scale-consistent depth and ego-motion learning from monocular video. In NeurIPS, 2019.
  • (3) Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
  • (4) Rich Caruana and Virginia R De Sa. Promoting poor features to supervisors: Some inputs work better as outputs. In NIPS, 1997.
  • (5) Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In AAAI, 2019.
  • (6) Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, challenges and baselines. TPAMI, 2020.
  • (7) Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. MonoSLAM: Real-time single camera SLAM. TPAMI, (6):1052–1067, 2007.
  • (8) Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. CoRR, abs/1802.07687, 2018.
  • (9) Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
  • (10) David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, 2014.
  • (11) Ruth Fong, Mandela Patrick, and Andrea Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. In ICCV, 2019.
  • (12) Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In ECCV, 2016.
  • (13) Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.
  • (14) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
  • (15) Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
  • (16) Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia Angelova. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In ICCV, 2019.
  • (17) Vitor Guizilini, Rares Ambrus, Sudeep Pillai, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. In CVPR, 2020.
  • (18) Vitor Guizilini, Rui Hou, Jie Li, Rares Ambrus, and Adrien Gaidon. Semantically-guided representation learning for self-supervised monocular depth. In ICLR, 2020.
  • (19) Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
  • (20) Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. CoRR, abs/1902.06162, 2019.
  • (21) Nal Kalchbrenner, Aäron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In ICML, 2017.
  • (22) Maria Klodt and Andrea Vedaldi. Supervising the new with the old: learning SFM from SFM. In ECCV, 2018.
  • (23) Jogendra Nath Kundu, Phani Krishna Uppala, Anuj Pahuja, and R Venkatesh Babu. Adadepth: Unsupervised content congruent adaptation for depth estimation. In CVPR, 2018.
  • (24) Yevhen Kuznietsov, Jörg Stückler, and Bastian Leibe. Semi-supervised deep learning for monocular depth map prediction. In CVPR, 2017.
  • (25) Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. UnDeepVO: Monocular visual odometry through unsupervised deep learning. In ICRA, 2018.
  • (26) Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolutional neural fields for depth estimation from a single image. In CVPR, 2015.
  • (27) Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, and Alan Yuille. Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. TPAMI, 2019.
  • (28) Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In CVPR, 2018.
  • (29) Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. ORB-SLAM: a versatile and accurate monocular SLAM system. TRO, 31(5):1147–1163, 2015.
  • (30) Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. In CVPR, 2018.
  • (31) Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In ECCV, 2016.
  • (32) Andrea Pilzer, Dan Xu, Mihai Puscas, Elisa Ricci, and Nicu Sebe. Unsupervised adversarial depth estimation using cycled generative networks. In 3DV, 2018.
  • (33) Anurag Ranjan, Varun Jampani, Lukas Balles, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J Black. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In CVPR, 2019.
  • (34) Christian Rupprecht, Iro Laina, Robert DiPietro, Maximilian Baust, Federico Tombari, Nassir Navab, and Gregory D Hager. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In ICCV, 2017.
  • (35) Davide Scaramuzza and Friedrich Fraundorfer. Visual odometry [tutorial]. IEEE Robotics & Automation Magazine, 18(4):80–92, 2011.
  • (36) Lv Sheng, Dan Xu, Wanli Ouyang, and Xiaogang Wang. Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep slam. In ICCV, 2019.
  • (37) Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
  • (38) Fabio Tosi, Filippo Aleotti, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Luigi Di Stefano, and Stefano Mattoccia. Distilled semantics for comprehensive scene understanding from videos. In CVPR, 2020.
  • (39) Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. Sfm-net: Learning of structure and motion from video. CoRR, abs/1704.07804, 2017.
  • (40) Chaoyang Wang, Jose Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. In CVPR, 2018.
  • (41) Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. In CVPR, 2018.
  • (42) Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In CVPR, 2018.
  • (43) Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Pad-net: Multi-tasks guided prediciton-and-distillation network for simultaneous depth estimation and scene parsing. In CVPR, 2018.
  • (44) Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Monocular depth estimation using multi-scale continuous crfs as sequential deep networks. TPAMI, 41(6):1426–1440, 2018.
  • (45) Nan Yang, Rui Wang, Jörg Stückler, and Daniel Cremers. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In ECCV, 2018.
  • (46) Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ramakant Nevatia. Unsupervised learning of geometry with edge-aware depth-normal consistency. In AAAI, 2018.
  • (47) Zhichao Yin and Jianping Shi. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, 2018.
  • (48) Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese.

    Taskonomy: Disentangling task transfer learning.

    In CVPR, 2018.
  • (49) Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid.

    Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction.

    In CVPR, 2018.
  • (50) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.
  • (51) Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
  • (52) Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In ECCV, 2018.