Embodied View-Contrastive 3D Feature Learning

by   Adam W. Harley, et al.
Carnegie Mellon University

Humans can effortlessly imagine the occluded side of objects in a photograph. We do not simply see the photograph as a flat 2D surface, we perceive the 3D visual world captured in it, by using our imagination to inpaint the information lost during camera projection. We propose neural architectures that similarly learn to approximately imagine abstractions of the 3D world depicted in 2D images. We show that this capability suffices to localize moving objects in 3D, without using any human annotations. Our models are recurrent neural networks that consume RGB or RGB-D videos, and learn to predict novel views of the scene from queried camera viewpoints. They are equipped with a 3D representation bottleneck that learns an egomotion-stabilized and geometrically consistent deep feature map of the 3D world scene. They estimate camera motion from frame to frame, and cancel it from the extracted 2D features before fusing them in the latent 3D map. We handle multimodality and stochasticity in prediction using ranking-based contrastive losses, and show that they can scale to photorealistic imagery, in contrast to regression or VAE alternatives. Our model proposes 3D boxes for moving objects by estimating a 3D motion flow field between its temporally consecutive 3D imaginations, and thresholding motion magnitude: camera motion has been cancelled in the latent 3D space, and thus any non-zero motion is an indication of an independently moving object. Our work underlines the importance of 3D representations and egomotion stabilization for visual recognition, and proposes a viable computational model for learning 3D visual feature representations and 3D object bounding boxes supervised by moving and watching objects move.



There are no comments yet.


page 4

page 6

page 15

page 17

page 18


CoCoNets: Continuous Contrastive 3D Scene Representations

This paper explores self-supervised learning of amodal 3D feature repres...

Learning Spatial Common Sense with Geometry-Aware Recurrent Networks

We integrate two powerful ideas, geometry and deep visual representation...

Learning Independent Object Motion from Unlabelled Stereoscopic Videos

We present a system for learning motion of independently moving objects ...

RADNet: A Deep Neural Network Model for Robust Perception in Moving Autonomous Systems

Interactive autonomous applications require robustness of the perception...

Embodied Language Grounding with Implicit 3D Visual Feature Representations

Consider the utterance "the tomato is to the left of the pot." Humans ca...

Geometry-Aware Recurrent Neural Networks for Active Visual Recognition

We present recurrent geometry-aware neural networks that integrate visua...

3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators

We propose an action-conditioned dynamics model that predicts scene chan...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human infants quickly develop the ability to fixate on moving objects spelke1982perceptual and perceive them as solid 3D shapes soska2008development . Though parents often supply object category labels, e.g., “it is not a dog, it is a cat", no 3D bounding box or 3D segmentation mask is supplied to aid development of human object detectors, not only across our lifespans, but also across during our evolutionary history. Yet, we all perceive the chair in front of our desk as occupying 3D space as opposed to being a 2D surface of floating pixels. This is remarkable considering that our visual sensors are 2.5D at best in each time step (with the exception of stereoblindness doi1011772041669517738542 ).

Inspired by the human ability for 3D perception and imagination, a long-standing debate in Computer Vision is whether it is worth pursuing 3D models in the form of binary voxel grids, meshes, or 3D pointclouds as the output of visual recognition. The “blocks world" of Larry Roberts in 1967

blocksworld set as its goal to reconstruct the 3D scene depicted in the image in terms of 3D solids found in a database. Pointing out that replicating the 3D world in one’s head is not enough to actually make decisions, Rodney Brooks argued for feature-based representations brooks1991intelligence

, as done by recent work in end-to-end deep learning

DBLP:journals/corr/LevineFDA15 . Whether explicit 3D representations are available in human cortex has not so far found a definite answer 10.1093/cercor/bhv229 . To 3D, or not to 3D, that is the question atkeson .

Our work answers this question in the affirmative, and proposes learning-based 3D feature representations in place of previous human-engineered ones, such as 3D pointclouds, meshes, or voxel occupancies. We propose recurrent neural networks equipped with a 3D representation bottleneck as their latent state, as shown in Figure 1. We train these networks to predict 2D and 3D abstractions of the world scene from various camera viewpoints, emulating the supervision available to an embodied mobile agent. We show that their latent 3D space learns to form plausible 3D imaginations of the scene given 2D or 2.5D video streams as input: it inpaints visual information behind occlusions, infers 3D extent of objects, and correctly predicts object occlusions and dis-occlusions. At their core, the proposed models estimate egomotion in each step and stabilize the extracted deep features before integrating them into the geometrically consistent 3D feature map. This is similar to what Simultaneous Localization and Mapping (SLAM) mur2017orb methods do when accumulating depth maps into a 3D scene pointcloud. Our models, which we call 3D-bottlenecked RNNs, use trainable SLAM-like modules with 2D and 3D deep features in place of depth maps or 3D pointclouds.

We handle multimodality and stochasticity during prediction with view contrastive losses in place of RGB regression: at each timestep, we project the accumulated-thus-far 3D feature map to a desired viewpoint, and optimize for matching the projected features against 2D or 3D features extracted bottom-up from the view under consideration. We show the proposed contrastive loss outperforms RGB view regression in semi-supervised learning of 3D object detectors, as well as in estimating 2D and 3D visual correspondences. Our models are trained in an end-to-end differentiable manner by backpropagating prediction errors.

We show that estimating dense 3D motion from frame to frame in the latent 3D imagination space allows label-free discovery of moving objects: any non-zero motion found while aligning consecutive 3D imaginations is an indication of independently moving entities, since camera motion has been cancelled. We show the proposed 3D object discovery outperforms alternatives of recent works Kosiorek:2018:SAI:3327757.3327951 ; NIPS2018_7333 that operate in 2D using synthetic moving MNIST or background-subtracted static-camera 2D videos, as well as 2.5D motion segmentation baselines.

Our models build upon the work of Tung et al. commonsense

, which also considered RNNs with a 3D representation bottleneck and egomotion-stabilized updates, similar to ours but trained for view regression. Their model was tested in very simplistic simulation environments, using two degree-of-freedom camera motion and only static scenes (i.e., without moving objects). We compare against their model in our experimental section and show features learnt under the proposed view-contrastive losses are more semantically meaningful.

Summarizing, we have the following contributions over prior works: (1) We propose novel view contrastive losses and show that they can scale to photorealistic scenes and learn semantic features for semi-supervised learning of 3D object detectors and 2D/3D visual correspondence, outperforming pixel regression Dosovitskiy17 ; commonsense and VAE Eslami1204 alternatives. (2) We propose a novel 3D moving object segmentation method by estimating and thresholding 3D motion in a latent egomotion-stabilized 3D feature space; we show it outperforms 2.5D baselines and iterative generative what-where VAEs Kosiorek:2018:SAI:3327757.3327951 ; NIPS2018_7333 of previous works.

Our model attempts to provide a scalable computational model of predictive coding rao1999predictive ; friston2003learning and mapping cogmap ; egobrain theories from cognitive neuroscience, by predicting future observations in 2D and 3D abstract feature space, while simultaneously building an internal geometrically consistent 3D map of the world scene. Our code and data will be made publicly available upon publication.

2 Related Work

Predictive visual feature learning Predictive coding theories rao1999predictive ; friston2003learning suggest that the brain predicts observations at various levels of abstraction. These theories currently have extensive empirical support: stimuli are processed more quickly if they are predictable mcclelland1981interactive ; pinto2015expectations , prediction error is reflected in increased neural activity rao1999predictive ; brodski2015faces , and disproven expectations lead to learning schultz1997neural

. Recent work in unsupervised learning has successfully used these ideas to learn word representations by predicting neighboring words


. Many challenges emerge going from a finite word vocabulary to the continuous high dimensional image data manifold. Unimodal losses such as mean squared error are not very useful when predicting high dimensional data, due to the stochasticity of the output space. Researchers have tried to handle such stochasticity using latent variable models

loehlin1987latent or autoregressive prediction of the output pixel space, which involves sampling each pixel value from a categorical distribution van2016conditional conditioned on the output thus far. Another option is to make predictions in a latent feature space. Recently, the work of Oord et al. oord2018representation followed this direction and used an objective that preserves mutual information between the future bottom-up extracted features and the predicted contextual latent features; it applied it in speech, text and image patches in static images. The view contrastive loss proposed in this work is a non-probabilistic version of their contrastive objective. However, our work focuses on the video domain as opposed to static image patches, and uses drastically different architectures for both the contextual and bottom-up representations, going through a 3-dimensional representation bottleneck, in contrast to oord2018representation . Alongside future image prediction rao1999predictive ; Eslami1204 ; DBLP:journals/corr/TatarchenkoDB15 ; commonsense , predicting some form of contextual or missing information has also been explored, such as predicting frame ordering lee2017unsupervised , spatial context DBLP:journals/corr/DoerschGE15 ; pathakCVPR16context , color from grayscale vondrick2018tracking , egomotion DBLP:journals/corr/JayaramanG15 ; DBLP:journals/corr/AgrawalCM15 and future motion trajectories DBLP:journals/corr/WalkerDGH16 .

Deep geometry Some recent works have attempted various forms of map-building gup ; henriques2018mapnet as a form of geometrically-consistent temporal integration of visual information, in place of geometry-unaware vanilla LSTM hochreiter1997long or GRU cho2014learning models. The closest to ours are Learnt Stereo Machines (LSMs) LSM , DeepVoxels sitzmann2018deepvoxels , and Geometry-aware RNNs (GRNNs) commonsense

that integrate images sampled from a viewing sphere into a latent 3D feature memory tensor, in an egomotion-stabilized manner, and predict views. All of these works consider very simplistic non-photorealistic environments. None of these prior works evaluates the suitability of their learnt features for a downstream task. Rather, accurately predicting views is their main objective.

Motion estimation and object segmentation Most motion segmentation methods operate in 2D image space, and us 2D optical flow to segment moving objects. While earlier approaches attempted motion segmentation completely unsupervised by exploiting motion trajectories and integrating motion information over time springerlink:10.1007/978-3-642-15555-0_21 ; OB11 , recent works focus on learning to segment objects in videos, supervised by annotated video benchmarks Fragkiadaki_2015_CVPR ; DBLP:journals/corr/KhorevaBIBS17 . Our work differs in the fact that we address object detection and segmentation in 3D as opposed to 2D, by estimating 3D motion of the “imagined" (complete) scene, as opposed to 2D motion of the observed scene.

3 Embodied view-contrastive 3D feature learning

We consider a mobile agent that can move its camera at will. In static scenes, our agent learns 3D visual feature representations by predicting 2D and 3D features of the view seen from a query camera viewpoint, and back-propagating the prediction errors (Section 3.1). In dynamic scenes, our agent learns to estimate dense 3D motion fields that bring its egomotion-stabilized 3D feature imaginations of consecutive timesteps into correspondence, and proposes 3D object segmentations by thresholding motion magnitude (Section 3.2). In both cases, learning relies solely on “labels" provided by the moving agent itself, that moves and watches objects move, without any human supervision.

Figure 1: View-contrastive feature learning with 3D-bottlenecked RNNs. Top: Learning visual feature representations by moving in static scenes. The 3D-bottlenecked RNNs learn to map 2.5D video streams to egomotion-stabilized 3D feature maps of the scene by optimizing for view-contrastive prediction. Bottom: Learning to segment 3D moving objects by watching them move. Non-zero 3D motion in the learnt stabilized 3D feature space reveals independently moving objects and their exact 3D extent, without any human annotations.

3.1 Learning 3D feature representations by moving

Our model’s architecture is illustrated in Figure 1 (top). It is an RNN with a 4D latent state , which has spatial resolution (width, height, and depth) and feature dimensionality (channels). The latent state aims to capture a geometrically consistent 3D deep feature map of the 3D world scene. We refer to the memory state as the model’s imagination to emphasize that most of the grid points in will not be observed by any sensor, and so the feature content must be “imagined" by the model. Our model has a set of differentiable modules to go back and forth between 3D feature imagination space and 2D image space. It builds upon the recently proposed geometry-aware recurrent neural networks (GRNNs) of Tung et al. commonsense , which also have a 4D egomotion-stabilized latent space, and are trained for RGB view prediction. We will use the terms GRNNs and 3D-bottlenecked RNNs interchangeably.

We briefly describe each module for completeness, and refer the reader to our supplementary file for more details. In comparison to Tung et al. commonsense , (i) our egomotion module can handle general (as opposed to two-degrees-of-freedom) camera motion, and show it performs on par with geometric methods in Section 4, and (ii) our 3D-to-2D projection module decodes the 3D map into 2D feature maps seen from the viewpoint under consideration, as opposed to an RGB image.

2D-to-3D unprojection This module converts the input RGB image and depth map into a 4D tensor , by filling the 3D imagination grid with samples from the 2D image pixel grid using perspective (un)projection, and mapping the depth map to a binary occupancy voxel grid , by assigning each voxel a value of 1 or 0, depending on whether or not a point lands in the voxel.

Latent map update This module aggregates egomotion-stabilized (registered) feature tensors into the memory tensor . We denote registered tensors with a subscript reg. We first pass the registered tensors through a series of 3D convolution layers, producing a 3D feature tensor for the timestep, denoted . On the first timestep, we set . On later timesteps, our memory update is computed using a running average operation.

Egomotion estimation This module computes the relative 3D rotation and translation between the current camera pose (from timestep ) and the reference pose (from timestep ) of the latent 3D map. Our egomotion module can handle general motion and uses spatial pyramids, incremental warping, and cost volume estimation via 6D cross-correlation, inspired by PWC-Net sun2018pwc , a state-of-the-art 2D optical flow method. For details, please refer to the supplementary file.

3D-to-2D projection This module “renders” 2D feature maps given a desired viewpoint by projecting the 3D feature state . We first orient the state map by resampling the 3D feature map into a view-aligned version , and map it to a set of 2D feature maps with 2D convolutions. We denote the final output of this 3D-to-2D process as .

3.1.1 View-contrastive 3D predictive learning

Given a set of (random) input views , we train our model to predict feature abstractions of the view seen from a (random) desired viewpoint , as shown in Figure 1. We learn these abstractions end-to-end using view contrastive metric learning. Specifically, we consider two representations for the target view : a context-based one , and a bottom-up one . Note that the context-based representation has access to the viewpoint (pose) but not the view , while the bottom-up representation is a function of the view . The metric learning loss ties the context-based and bottom-up representations together.

We explore two variants for our contrastive predictive learning, one 2-dimensional and one 3-dimensional, as shown in Figure 1. These differ in the space where we compute the inner product between context-based and bottom-up representations. We obtain the contextual tensor by orienting the 3D feature map built thus far to the query viewpoint . We obtain the contextual tensor by projecting using the 3D-to-2D module. We obtain the bottom-up tensor by feeding to the 2D-to-3D unprojection module. We obtain the bottom-up tensor by convolving with a residual convolutional network. The corresponding contrastive losses read:


where is a margin parameter, is a learned variable controlling the position of the boundary, indicates when corresponds to , and similarly for the 3D equation. The contrastive losses pull corresponding (contextual and bottom-up) features close together in embedding space, and push non-corresponding ones beyond a margin of distance. It has been shown that the performance of a metric learning loss depends heavily on the sampling strategy used schroff2015facenet ; DBLP:journals/corr/SongXJS15 ; sohn2016improved . We use the distance-weighted sampling strategy proposed by Wu et al. wu2017sampling which uniformly samples “easy” and “hard” negatives wu2017sampling ; lehnensphere ; we find this outperforms both random sampling and semi-hard schroff2015facenet negative sampling.

Intuitively, the proposed corresponds to a discriminative alternative to RGB pixel or deep feature L2 error. Our 3D metric learning loss asks 3D feature tensors depicting the same scene, but acquired from different viewpoints, to contain the same feature content. This 3D consistency is not necessarily the case when training under a standard pixel regression or loss. This is is important for accurate dense 3D motion estimation, described next.

3.2 Learning to segment 3D objects by watching them move

Though the model is trained using sequences of frames as input, upon training it learns to map a single RGB-D image to a complete 3D imagination, as we show in Figure 1 (bottom). This is in contrast to SLAM methods, which do not learn or improve with experience, but always need to observe a scene from multiple viewpoints in order to reconstruct it. We freeze the imagination model’s weights, and use its features to estimate dense 3D motion in the latent 4D feature space.

Specifically, given two temporally consecutive (registered) 3D maps predicted independently using images , and depth maps ,

as input, we train a 3D FlowNet to predict the dense 3D motion field between them. Since there is no camera motion, this 3D motion field will only be non-zero for independently moving objects. Our 3D FlowNet works by iterating across scales in a coarse-to-fine manner. At each scale, we compute a 3D cost volume, convert these costs to 3D displacement vectors and incrementally warp the two tensors to align them, essentially re-targeting a state-of-the-art optical flow method

sun2018pwc to operate in a 3D instead of 2D pixel grid. We train our 3D FlowNet supervised using synthetic augmentations, by rigidly transforming the first map and asking the FlowNet to recover the dense 3D flow field that corresponds to this transformation, as well as via a warping error, where we back-warp to align it with , and backpropagate the L1 of the difference. This extends self-supervised 2D flow works back_to_basics:2016 to 3D feature constancy (instead of 2D brightness constancy). We found that both the synthetic and warp-based supervision are essential for obtaining accurate 3D flow field estimates.

The computed 3D flow enjoys the benefits of permanent 3D feature content over time, and the lack of projective distortions. While 2D optical flow suffers from occlusions and dis-occlusions of image content Sun:CVPR:10 (for which 2D flow values are undefined) the proposed 3D imagination flow is always well-defined. Moreover, object motion in 3D does not suffer from projection artifacts, that transform rigid 3D transformations into non-rigid 2D flow fields. While 3D scene flow Hornacek_2014_CVPR concerns visible 3D points, 3D imagination flow is computed between visual features that may never have appeared in the field of view, but are rather inpainted by imagination. Since we are not interested in the 3D motion of empty air voxels, we additionally learn to estimate 3D voxel occupancy supervised by input depth maps, and set the 3D motion of all unoccupied voxels to zero. We describe our 3D occupancy learning in the supplementary file.

We obtain 3D object segmentation proposals by thresholding the 3D imagination flow magnitude, and clustering voxels using connected components. We score each component using a 3D version of a center-surround motion saliency score employed by numerous works for 2D motion saliency detection 19146246 ; Mahadevan10spatiotemporalsaliency . This score is high when the 3D box interior has lots of motion but the surrounding shell does not. This results in a set of scored 3D segmentation proposals for each video scene.

Our work is close to a long line of works in unsupervised 2D motion segmentation springerlink:10.1007/978-3-642-15555-0_21 ; OB11 . We show that be applying similar methods in an egomotion-stabilized 3D imagination space, as opposed to 2D pixel space, simple background subtraction suffices for finding moving objects.

4 Experiments

We test our models in CARLA Dosovitskiy17 , an open-source photorealistic simulator of urban driving scenes, which permits moving the camera to any desired viewpoint in the scene. We collect video sequences from CARLA’s City1 as the training set, and videos from City2 as the test set. Further details regarding the CARLA dataset collection can be found in the supplementary file.

Rendered images from the CARLA simulator have complex textures and specularities and are close to photorealism, which causes RGB view prediction methods to fail. In Figure 2, we show images predicted by the RGB regression of Tung et al. commonsense , (ii) a VAE alternative of Tung et al. commonsense , and Generative Query Networks (GQN) of Eslami et al. Eslami1204 which do not have a 3D representation bottleneck. We visualize our model’s predictions as colorized PCA embeddings, for interpretability. In the supplementary material we evaluate our method and the baselines in 2D and 3D visual correspondence and retrieval tasks, and show our model dramatically outperforms the baselines.

Figure 2: Sample renders from the contrastive and generative models. Given the input view, and the pose of the target view, the model is tasked with rendering the target view. For each input, we show our model’s bottom-up and context (top-down) embeddings of the target view, along with the RGB render of Tung et al. commonsense and a VAE variant of it, and the RGB render of GQN Eslami1204 .

Our work uses view prediction as a means of learning useful visual representation for 3D object detection, segmentation and motion estimation, not as the end task itself. Our experiments below evaluate the model’s performance on (1) motion estimation, (2) unsupervised moving object segmentation, and (3) semi-supervised 3D object detection.

4.1 3D motion estimation

We first evaluate the accuracy of the our 3D FlowNet. Note that our 3D FlowNet operates after egomotion stabilization, so to isolate the error due to the flow estimation alone (and not due to egomotion error), we use ground-truth egomotion to stabilize the scene. Results are shown in Table 1. We compare our model against the RGB view prediction baseline of Tung et al. commonsense : for both models we use the same 3D flow estimation described in Section 3.2; what differs are the input 3D feature maps . We also use a zero-motion baseline that predicts zero motion everywhere. Since approximately 97% of the voxels belong to the static scene a zero-motion baseline is very competitive in an overall average. We therefore score error separately in static and moving parts of the scene. Our method shows dramatically lower error than the RGB prediction baseline. In Table 2, we evaluate our egomotion module against ORB-SLAM2 mur2017orb , a state-of-the-art geometric SLAM method, and show comparable performance.

Method Full Static Moving Zero-motion 0.19 0.0 6.97 Tung et al. commonsense 0.77 0.63 5.55 Ours 0.26 0.17 3.33 Table 1: Mean endpoint error of the 3D flow in egomotion-stabilized scenes. Features learned from the proposed view-contrastive objective result in more accurate 3D motion estimates. Method (rad) (m) ORB-SLAM2 mur2017orb 0.089 0.038 Ours 0.120 0.036 Table 2: Egomotion error. Our model performs on par with ORB-SLAM2.

4.2 Unsupervised 3D object motion segmentation

We test the proposed 3D object motion segmentation in a dataset of two-frame video sequences of dynamic scenes. We compare (i) our full model (trained with ), (ii) the same model but finetuned on the test data, (iii) the RGB-based model of Tung et al. commonsense , and (iv) a 2.5D baseline (2.5D PWC-Net). The 2.5D baseline computes 2D optical flow (using PWC-Flow sun2018pwc ), then thresholds the magnitudes and obtains mask proposals in 2D; these proposals are mapped to 3D boxes using the input depth.

Our 3D motion estimation on the ego-stabilized space is affected by the quality of the egomotion estimate. We thus evaluate it in the follow setups: (i) the camera is stationary (S), (ii) the camera is moving from to but we stabilize using groundtruth egomotion (GE), (iii) the camera is moving from to and we stabilize using our estimated egomotion (EE). We show precision-recall curves of 3D object detection in Figure 4.3.

Our contrastive model outperforms all baselines. Finetuning on the test scenes with view contrastive prediction slightly improves the features and permits better 3D dense motion estimation and segmentation. Using estimated egomotion instead of ground-truth only incurs a small cost in performance. The standard RGB regression loss of Tung et al. commonsense leads to relatively imprecise object detections. The 2.5D baseline performs reasonably in the static-camera setting, but fails when the camera moves.

We have also attempted to compare against the VAE methods proposed in Kosiorek:2018:SAI:3327757.3327951 ; NIPS2018_7333 . In those models, the inference network uses the full video frame sequences to predict the locations of object bounding boxes, as well as frame-to-frame displacements, in order to minimize view prediction error. We were not able to produce meaningful results with the inference network. We attribute this failure to the difficulty of localizing 3D objects starting from random initialization. The success of NIPS2018_7333 may partially depend on carefully selected priors for 2D bounding box location and size parameters that match the moving MNIST dataset statistics, as suggested by the publicly available code; we do not assume knowledge or existence of such priors for our CARLA data.

4.3 Semi-supervised learning of 3D object detection

In this section, we evaluate whether the proposed self-supervised 3D predictive learning improves the generalization of 3D object detectors when combined with a small labelled training set (note that in the previous section we obtained 3D segmentations without any 3D box annotations). We consider two evaluation setups: (i) Pre-training features with self-supervised view-contrastive prediction (in both training and test sets), then training a 3D object detector on top of these features in a supervised manner, while either freezing or finetuning the initial features. (ii) Co-training supervised 3D object detection with self-supervised view-contrastive prediction in both train and test sets. Both regimes, pre-training and co-training, have been previously used in the literature to evaluate the suitability of self-supervised (2D) deep features when re-purposed for a semantic 2D detection task DBLP:journals/corr/DoerschGE15 ; here, we re-purpose 3D feature representations for 3D object detection. Our 3D object detector is an adaptation of the state-of-the-art 2D object detector, Faster-RCNN DBLP:journals/corr/HeGDG17 , to have 3D input and output instead of 2D. The 3D input is simply our model’s hidden state .

Pre-training and co-training results are shown in Table 3. We show the mean average precision of the predicted 3D boxes in the test sets for three different IOU thresholds. In every setting, on every considered threshold, both RGB-based and contrastive predictive training improve mean average precision (mAP) over the supervised baseline; the proposed view-contrastive loss gives the largest performance boost, especially in the pre-training case.

We next evaluate how results vary across the number of labelled examples used for supervision. We compare models pre-trained with contrastive view prediction against models trained from scratch. We vary the amount of supervision from labelled examples to . In the pre-training case, we freeze the feature layers after unsupervised view-contrastive learning, and only supervise the detector; in the supervised case, we train end-to-end. As expected, the fully-supervised models perform better as more labelled data is added. In low-data settings, pre-training greatly improves results (e.g., 0.53 vs. 0.39 mAP at 500 labels). When all labels are used, the semi-supervised and supervised models are equally strong. We additionally evaluate a set of models pre-trained (unsupervised) on all available data (train+test); these perform slightly better than the models that only use the training data. Overall, the results suggest that contrastive view prediction leads to features that are relevant for object detection, which is especially useful when few labels are available.

Figure 3: Unsupervised 3D moving object detection with a stationary camera (S), groundtruth egomotion (GE), and estimated egomotion (EE). Our model outperforms all baselines, across all recall values. Figure 4: Semi-supervised 3D object detection accuracy, as a function of the number of labelled examples. Pre-training with view prediction improves results, especially when there are few labels.
Method mAP
@0.33 IOU @0.50 IOU @0.75 IOU
Pre-training then freezing the feature encoder
    No pre-training (i.e., random feature encoder) .77 .70 .34
    Pre-training with generative objective commonsense .88 .81 .39
    Pre-training with contrastive objective .90 .83 .53
Pre-training then finetuning end-to-end
    Pre-training with generative objective commonsense .91 .84 .55
    Pre-training with contrastive objective .91 .85 .59
Co-training with view prediction
    No co-training (i.e., supervised baseline) .91 .83 .63
    Co-training with generative objective commonsense .92 .88 .64
    Co-training with contrastive objective .95 .90 .69

Table 3: Semi-supervised 3D object detection. The proposed view-contrastive self-supervised loss dramatically outperforms model trained purely from supervised labels, as well as semi-supervised ones with RGB view prediction.

The proposed model has two important limitations. First, the 3D latent space makes heavy use of GPU memory, which limits either the resolution or the metric span of the latent map. Second, our results are conducted in simulated environments. Our work argues in favor of embodiment, but this is hard to realize in practice, as collecting multiview data in the real world is challenging. Training and testing our model in the real world with low-cost robots DBLP:journals/corr/abs-1807-07049 is a clear avenue for future work.

5 Conclusion

We propose models that learn 3D latent imaginations of the world given 2.5D input by minimizing 3D and 2D view contrastive prediction objectives. They further detect moving objects by estimating and thresholding residual 3D motion in their latent (stabilized) imagination space, generalizing background subtraction to arbitrary camera motion and scene geometry. Our empirical findings suggest that embodiment—equipping artificial agents with cameras to view the world from multiple viewpoints—can substitute human annotations for 3D object detection. “We move in order to see and we see in order to move", said J.J. Gibson in 1979 Gibson1979-GIBTEA . Our work presents 3D view-contrastive predictive learning as a working scalable computational model of such motion supervision, and demonstrates its clear benefits over previous methods.


We thank Christopher G. Atkeson for providing the historical context of the debate on the utility and role of geometric 3D models versus feature-based models, and for many other helpful discussions.


Appendix A Architecture details for 3D bottlenecked RNNs

Our model builds upon the geometry-aware recurrent neural networks (GRNNs) of Tung et al. commonsense . These models are RNNs that have a 4D latent state , which has spatial resolution (width, height, and depth) and feature dimensionality (channels). At each time step, we estimate the rigid transformation between the current camera viewpoint and the coordinate system of the latent map , then rotate and translate the features extracted from the current input view and depth map to align them with the coordinate system of the latent map, and convolutionally update the latent map using a standard convolutional 3D GRU, 3D LSTM or plain feature averaging. We found plain averaging to work well, while being much faster than GRU or LSTM. We refer to the memory state as the model’s imagination to emphasize that most of grid points in will not be observed by any sensor, and so the feature content must be “imagined" by the model. We use a resolution of for . The imagination space corresponds to a large box of Euclidean world space. We set this box to a size of meters, and place it in front of the camera’s position at the first timestep, oriented parallel to the ground plane. The input images are pixels. We trim input pointclouds to a maximum of 100,000 points, and to a range of 80 meters, to simulate a Velodyne sensor.

Below we present the individual modules in detail. The full set of modules allows the model to differentiably go back and forth between 2D pixel observation space and 3D imagination space.

2D-to-3D unprojection

This module converts the input 2D image and depth map into a 4D tensor , by filling the 3D imagination grid with samples from the 2D image grid, using perspective (un)projection. Specifically, for each cell in the imagination grid, indexed by the coordinate , we compute the floating-point 2D pixel location that it projects to from the current camera viewpoint, using the pinhole camera model hartley2003multiple , where is the similarity transform that converts memory coordinates to camera coordinates and is the camera intrinsics (transforming camera coordinates to pixel coordinates). We fill

with the bilinearly interpolated pixel value

. We transform our depth map in a similar way and obtain a binary occupancy grid , by assigning each voxel a value of 1 or 0, depending on whether or not a point lands in the voxel. We concatenate this to the unprojected RGB, making the tensor .

In order to learn a geometrically consistent 3D feature map, our model needs to register all unprojected tensors over time by cancelling the relative 3D rotation and translation between the camera viewpoints. We denote registered tensors with a subscript reg. We treat the first camera position as the reference system thus (and ). On later timesteps (assuming the camera moves), we obtain using a modified sampling equation, given by , where is the camera pose at timestep (transforming reference-camera coordinates to current-camera coordinates). In this work, we assume access to ground-truth camera pose information at training time, since an active observer does have access to its approximate egomotion. At test time, we estimate the egomotion. Unlike the GRNN in Tung et al. commonsense , our GRNN’s camera is not restricted to 2D motion along a viewing sphere; instead, we consider a general camera with full 6-DoF motion.

Latent map recurrent update

This module aggregates egomotion-stabilized (registered) feature tensors into the memory tensor . We first pass the registered tensors through a series of 3D convolution layers, producing a 3D feature tensor for the timestep, denoted . On the first timestep, we set . On later timesteps, our memory update is computed using a running average operation. 3D convolutional LSTM or 3D convolutional GRU updates could be used instead.

The 3D feature encoder-decoder has the following architecture, using the notation --

for kernel-stride-channels: 4-2-64, 4-2-128, 4-2-256, 4-0.5-128, 4-0.5-64, 1-1-

, where is the feature dimension. We use

. After each deconvolution (stride 0.5 layer) in the decoder, we concatenate the same-resolution featuremap from the encoder. Every convolution layer (except the last in each net) is followed by a leaky ReLU activation and batch normalization.

Egomotion estimation

This module computes the relative 3D rotation and translation between the current camera viewpoint and the reference coordinate system of the map . We significantly changed the module of Tung et al. commonsense which could only handle 2 degrees of camera motion. Our egomotion module is inspired by the state-of-the-art PWC-Net optical flow method sun2018pwc : it incorporates spatial pyramids, incremental warping, and cost volume estimation via cross-correlation.

While and can be used directly as input to the egomotion module, we find better performance can be obtained by allowing the egomotion module to learn its own featurespace. Thus, we begin by passing the (unregistered) 3D inputs through a series of 3D convolution layers, producing a reference tensor , and a query tensor . We wish to find the rigid transformation that aligns the two.

We use a coarse-to-fine architecture, which estimates a coarse 6D answer at the coarse scale, and refines this answer in a finer scale. We iterate across scales in the following manner: First, we downsample both feature tensors to the target scale (unless we are at the finest scale). Then, we generate several 3D rotations of the second tensor, representing “candidate rotations", making a set , where is the discrete set of 3D rotations considered. We then use 3D axis-aligned cross-correlations between and the , which yields a cost volume of shape , where is the total number of spatial positions explored by cross-correlation. We average across spatial dimensions, yielding a tensor shaped , representing an average alignment score for each transform. We then apply a small fully-connected network to convert these scores into a 6D vector. We then warp according to the rigid transform specified by the 6D vector, to bring it into (closer) alignment with . We repeat this process at each scale, accumulating increasingly fine corrections to the initial 6D vector.

Similar to PWC-Net sun2018pwc , since we compute egomotion in a coarse-to-fine manner, we need only consider a small set of rotations and translations at each scale (when generating the cost volumes); the final transform composes all incremental transforms together. However, unlike PWC-Net, we do not repeatedly warp our input tensors, because this accumulates interpolation error. Instead, following the inverse compositional Lucas-Kanade algorithm baker2004lucas ; lin2017inverse , and at each scale warp the original input tensor with the composed transform.

After solving for the relative pose, we generate the registered 3D tensors by re-unprojecting the raw inputs (using the solved pose during unprojection). Note that an alternative registration method is to bilinearly warp the unregistered tensors using the relative pose (as done in Tung et al. commonsense ), but we find that this introduces noticeable and unnecessary interpolation error.

3D-to-2D projection

This module “renders” 2D feature maps given a desired viewpoint by projecting the 3D feature state . We first appropriately orient the state map by resampling the 3D featuremap into a view-aligned version . The sampling is defined by , where is (as before) the similarity transform that brings imagination coordinates into camera coordinates, is the transformation that relates the reference camera coordinates to the viewing camera coordinates, are voxel indices in , and are voxel indices in . We then warp the view-oriented tensor such that perspective viewing rays become axis-aligned. We implement this by sampling from the memory tensor with the correspondence , where the indices span the image we wish to generate, and spans the length of each ray. We use logarithmic spacing for the increments of , finding it far more effective than linear spacing (used in prior work commonsense ), likely because our scenes cover a large metric space. We call the perspective-transformed tensor . To avoid repeated interpolation, we actually compose the view transform with the perspective transform, and compute from with a single trilinear sampling step. Finally, we pass the perspective-transformed tensor through a series of 2D convolutional layers, converting it to a 2D feature map. We denote the final output of this 3D-to-2D process as . Note that the previous work of Tung et al. commonsense decode directly to an RGB image using an LSTM residual decoder. We use their model as our baseline in the experimental section.

The view renderer has the following architecture: maxpool 8-8 along the depth axis, 3-1-32 (3D conv), reshape from 4D to 3D, 3-1-32 (2D conv), 1-1- (2D conv), where is the embedding/channel dimension. For predicting RGB, ; for metric learning, we use . We find that with dimensionality or less, the model underfits.

3D imagination FlowNet

To train our 3D FlowNet, we generate supervised labels from synthetic transformations of the input, and an unsupervised loss based on the standard standard variational loss horn1981determining ; back_to_basics:2016

. For the synthetic transformations, we randomly sample from three uniform distributions of rigid transformations: (i)

large motion, with rotation angles in the range (degrees) and translations in (meters), (ii) small motion, with angles from and translations from , (iii) zero motion. We found that without sampling (additional) small and zero motions, the model does not accurately learn these ranges. Still, since these synthetic transformations cause the entire tensor to move at once, a FlowNet learned from this supervision alone tends to produce overly-smooth flow in scenes with real (non-rigid) motion. The variational loss, described next, overcomes this issue.

In the variational loss horn1981determining ; back_to_basics:2016 , we use a pair of consecutive frames with motion between them (including camera and object motion), estimate the flow, and back-warp the second tensor to align with the first. Compared to a standard optical flow method, we apply this loss in 3D with voxel features, rather than in 2D with image pixels:


where is the memory tensor, and is the inverse-warped tensor from the next timestep. We apply the warp with a differentiable 3D spatial transformer layer, which does trilinear interpolation to resample each voxel. Note that this loss makes an assumption of voxel feature constancy (instead of the traditional pixel brightness constancy)—and this assumption is made true across views by our metric learning loss.

We also apply a smoothness loss penalizing local flow changes: , where F is the estimated flow field and is the 3D spatial gradient. This is a standard technique to prevent the model from only learning motion edges horn1981determining ; back_to_basics:2016 .

Unlike 2D flow methods, our unsupervised 3D flow does not require special terms to deal with occlusions and disocclusions, since 3D flow is defined everywhere inside the grid.

Note the flow of freespace voxels is arbitrary. It is for this reason that we element-wise multiply the flow grid by the occupancy grid before attempting object discovery. From another perspective, this multiplication ensures that we only discover solid objects.

3D occupancy estimation

The goal in this step is to estimate which voxels in the imagination grid are “occupied” (i.e., have something visible inside) and which are “free” (i.e., have nothing visible inside).

For supervision, we obtain (partial) labels for both “free” and “occupied” voxels using the input depth data. Sparse “occupied” voxel labels are given by the voxelized pointcloud . To obtain labels of “free” voxels, we trace the source-camera ray to each occupied observed voxel, and mark all voxels intersected by this ray as “free”.

Our occupancy module takes the memory tensor as input, and produces a new tensor , with a value in

at each voxel, representing the probability of the voxel being occupied. This is achieved by a single 3D convolution layer with a

filter (or, equivalently, a fully-connected network applied at each grid location), followed by a sigmoid nonlinearity. We train this network with the logistic loss,


where is the label map, and is an indicator tensor, indicating which labels are valid. Since there are far more “free” voxels than “occupied”, we balance this loss across classes within each minibatch.

Bottom-up network

The “bottom-up” embedding net is a residual network DBLP:journals/corr/HeZRS15 with two residual blocks, with two convolutions and one transposed convolution in each block. The convolution layers’ channel dimensions are 64, 64, 64, 128, 128, 128. Finally there is one convolution layer with channels, where is the embedding dimension. We use .

Contrastive loss

For both the 2D and the 3D contrastive loss, for each example in the minibatch, we randomly sample a set of 960 pixel/voxel coordinates for supervision. Each coordinate gives a positive correspondence , since the tensors are aligned. For each , we sample a negative from the samples acquired across the entire batch, using the distance-weighted sampling strategy of Wu et al. wu2017sampling . In this way, on every iteration we obtain an equal number of positive and negative samples, where the negative samples are spread out in distance. We additionally apply an L1 loss on the difference between the entire tensors, which penalizes distance at all positive correspondences (instead of merely the ones sampled for the metric loss). We find this accelerates training. We use a coefficient of for , for , and for the L1 losses.

a.1 Code and training details

Our model is implemented in Python/Tensorflow, with custom CUDA kernels for the 3D cross correlation (used in the egomotion module and the flow module) and trilinear resampling (used in the 2D-to-3D and 3D-to-2D modules). The CUDA operations use less memory than native-tensorflow equivalents, which facilitates training with large imagination tensors (


We train our models using a batch size of either 2 or 4, depending on the number of active modules. This is approximately the memory limit of a 12G Titan X Pascal GPUs, which is what we use to train our model. Training to convergence ( 200k iterations) takes approximately 48 hours. We use a learning rate of 0.001 for all modules except the 3D FlowNet, for which we use 0.0001. We use the Adam optimizer, with , .

Our code and data will be made publicly available upon publication.

Appendix B Additional experiments

b.1 Datasets

Figure 5: Dataset comparison for view prediction. From left to right: Shepard-Metzler Eslami1204 , Rooms-Ring-Camera Eslami1204 , ShapeNet arrangements commonsense , cars DBLP:journals/corr/TatarchenkoDB15 and CARLA (used in this work) Dosovitskiy17 . CARLA scenes are more realistic, and are not object-centric.
CARLA vs. other datasets

We test our method on scenes we collected from the CARLA simulator Dosovitskiy17 , an open-source driving simulator of urban scenes. CARLA permits moving the camera to any desired viewpoint in the scene, which is necessary for our view-based learning strategy. Previous view prediction works have considered highly synthetic datasets: The work of Eslami et al. Eslami1204 introduced the Shepard-metzler dataset, which consists of seven colored cubes stuck together in random arrangements, and the Rooms-Ring-Camera dataset, which consists of a random floor and wall colors and textures with variable numbers of shapes in each room of different geometries and colors. The work of Tung et al. commonsense introduced a ShapeNet arrangements dataset, which consists of table arrangements of ShapeNet synthetic models shapenet2015 . The work of DBLP:journals/corr/TatarchenkoDB15 considers scenes with a single car. Such highly synthetic and limited-complexity datasets cast doubt on the usefulness and generality of view prediction for visual feature learning. The CARLA simulation environments considered in this work have photorealistic rendering, and depict diverse weather conditions, shadows, and objects, and arguably are much closer to real world conditions, as shown in Figure 5. While there exist real-world datasets which are visually similar Geiger2013IJRR ; caesar2019nuscenes , they only contain a small number viewpoints, which makes view-predictive training inapplicable.

CARLA data collection

We obtain data from the simulator as follows. We generate 1170 autopilot episodes of 50 frames each (at 30 FPS), spanning all weather conditions and all locations in both “cities” in the simulator. We define 36 viewpoints placed regularly along a 20m-radius hemisphere in front of the ego-car. This hemisphere is anchored to the ego-car (i.e., it moves with the car). In each episode, we sample 6 random viewpoints from the 36 and randomly perturb their pose, and then capture each timestep of the episode from these 6 viewpoints. We generate train/test examples from this, by assembling all combinations of viewpoints (e.g., viewpoints as input, and unseen viewpoint as the target). We filter out frames that have zero objects within the metric “in bounds” region of the GRNN (). This yields 172524 examples: 124256 in City1, and 48268 in City2. We treat the City1 data as the “training” set, and the City2 data as the “test” set, so there is no overlap between the train and test images.

Method 2D patch retrieval 3D patch retrieval
P@1 P@5 P@10 P@1 P@5 P@10
2D-bottlenecked generative architectures
    GQN with 2D L1 loss Eslami1204 .00 .01 .02 - - -
3D-bottlenecked generative architectures
    GRNN with 2D L1 loss commonsense .00 .00 .00 .42 .57 .60
    GRNN-VAE with 2D L1 loss & KL div. .00 .00 .01 .34 .58 .66
3D bottlenecked contrastive architectures
    GRNN with 2D contrastive loss .14 .29 .39 .52 .73 .79
    GRNN with 2D & 3D contrastive losses .20 .39 .47 .80 .97 .99
Table 4: Retrieval precision at different recall thresholds, in 2D and in 3D. P@K indicates precision at the th rank. Contrastive learning outperforms generative learning by a substantial margin.

b.2 2D and 3D correspondence

We evaluate our model’s performance in estimating visual correspondences in 2D and in 3D, using a nearest-neighbor retrieval task. In 2D, the task is as follows: we extract one “query” patch from a top-down render of a viewpoint, then extract 1000 candidate patches from bottom-up renders, with only one true correspondence (i.e., 999 negatives). We then rank all bottom-up patches according to L2 distance from the query, and report the retrieval precision at several recall thresholds, averaged over 1000 queries. In 3D the task is similar, but patches are feature cubes extracted from the 3D imagination; we generate queries from one viewpoint, and retrievals from other viewpoints and other scenes. The 1000 samples are generated as follows: from 100 random test examples, we generate 10 samples from each, so that each sample has 9 negatives from the same viewpoint, and 990 others from different locations/scenes.

We compare the proposed model against (i) the RGB prediction baseline of Tung et al. commonsense , (ii) Generative Query Networks (GQN) of Eslami1204 that does not have a 3D representation bottleneck, (iii) a VAE alternative of the (deterministic) model of Tung et al. commonsense . We provide experimental details in the supplementary file.

Figure 6: 2D image patch retrievals acquired with the contrastive model (left) vs the generative model (right). Each row corresponds to a query. For each model, the query is shown on the far left, and the 10 nearest neighbors are shown in ascending order of distance. Correct retrievals are highlighted with a green border. The neighbors of the contrastive model often have clear semantic relationships (e.g., curbs, windows). The neighbors of the RGB model appear to be only related in brightness.

Quantitative results are shown in Table 4. For 2D correspondence, the models learned through the RGB prediction objectives obtain precision near zero at each recall threshold, illustrating that the model is not learning precise RGB predictions. The proposed view contrastive losses perform better, and combining both the 2D and 3D contrastive losses is better than using only 2D. Interestingly, for 3D correspondence, the retrieval accuracy of the RGB-based models is relatively high. Training 3D bottlenecked RNNs as a variational autoencoder, where stochasticity is added in the 3D bottleneck, improves its precision at lower ranks thresholds. Contrastive learning outperforms all baselines. Adding the 3D contrastive loss gives a large boost over using the 2D contrastive loss alone. Note that 2D-bottlenecked architectures Eslami1204 cannot perform 3D patch retrieval. Qualitative retrieval results for our full model vs. Tung et al. commonsense are shown in Figure 6.

b.3 3D motion estimation

The error metrics presented in the main paper are in voxel units. These can be converted to meters by multiplying by the voxel-to-meter ratio, which is 4. For this evaluation (and for training) we subsample every 6th frame to obtain large motion. Sample flows are visualized against ground truth in Figure 7.

Figure 7: 3D feature flow visualizations. Given the input frames on the left, our model estimates dense 3D flow fields; these are visualized on the right, along with the corresponding ground truth. We visualize the 3D flow fields in a “bird’s eye view” by taking an average along the vertical dimension of the tensor. To improve visual clarity, we use a weighted average, using the 3D occupancy tensor (of the same shape) to supply the weights. We visualize the unprojected RGB in the same way; note the long “shadows” cast by occlusion. Flow is visualized using the standard 2D flow color map (e.g., cyan indicates leftward motion, purple indicates downward motion).

b.4 Unsupervised 3D object motion segmentation

Our method proposes 3D object segmentations, but labels are only available in the form of oriented 3D boxes; we therefore convert our segmentations into boxes by fitting minimum-volume oriented 3D boxes to the segmentations. The precision-recall curves presented in the main paper are computed with an intersection-over-union (IOU) threshold of . Figure 8 shows sample visualizations of the 3D box proposals projected onto input images.

Figure 8: Left: 3D object proposals and their center-surround scores (normalized to the range [0,1]). For each proposal, the inset displays the corresponding connected component of the 3D flow field, viewed from a top-down perspective. In each row, an asterisk marks the box with the highest center-surround score. Right: Observed and estimated heightmaps of the given frames, computed from 3D occupancy grids. Note that the observed (ground truth) heightmaps have view-dependent “shadows” due to occlusions, while the estimated heightmaps are dense and viewpoint-invariant.

b.5 Semi-supervised object detection

The 3D Faster-RCNN, similar to its 2D counterpart, outputs axis-aligned boxes. We obtain axis-aligned 3D ground-truth from the CARLA simulator. This is unlike the self-supervised setting, where we produce oriented boxes (from motion segmentation) and evaluate against oriented ground truth.

b.6 Occupancy prediction

We test our model’s ability to estimate occupied and free space. Given a single view as input, the model outputs an occupancy probability for each voxel in the scene. Then, given the aggregated labels computed from this view and a random next view, we compute accuracy at all voxels for which we have labels. Voxels that are not intersected by either view’s camera rays are left unlabelled. Table 5 shows the classification accuracy, evaluated independently for free and occupied voxels, and for all voxels aggregated together. Overall, accuracy is extremely high (97-98%) for both voxel types. Note that part of the occupied space (i.e., the voxelized pointcloud of the first frame) is an input to the network, so accuracy on this metric is expected to be high.

3D area type Classification accuracy
Occupied space .97
Free space .98
All space .98
Table 5: Classification accuracy of the occupancy prediction module. Accuracy is nearly perfect in each region type.

We show a visualization of the occupancy grids in Figure 8 (right). We visualize the occupancy grids by converting them to heightmaps. This is achieved by multiplying each voxel’s occupancy value by its height coordinate in the grid, and then taking a max along the grid’s height axis. The visualizations show that the occupancy module learns to fill the “holes” of the partial view, effectively imagining the complete 3D scene.