1 Introduction
Human infants quickly develop the ability to fixate on moving objects spelke1982perceptual and perceive them as solid 3D shapes soska2008development . Though parents often supply object category labels, e.g., “it is not a dog, it is a cat", no 3D bounding box or 3D segmentation mask is supplied to aid development of human object detectors, not only across our lifespans, but also across during our evolutionary history. Yet, we all perceive the chair in front of our desk as occupying 3D space as opposed to being a 2D surface of floating pixels. This is remarkable considering that our visual sensors are 2.5D at best in each time step (with the exception of stereoblindness doi1011772041669517738542 ).
Inspired by the human ability for 3D perception and imagination, a longstanding debate in Computer Vision is whether it is worth pursuing 3D models in the form of binary voxel grids, meshes, or 3D pointclouds as the output of visual recognition. The “blocks world" of Larry Roberts in 1967
blocksworld set as its goal to reconstruct the 3D scene depicted in the image in terms of 3D solids found in a database. Pointing out that replicating the 3D world in one’s head is not enough to actually make decisions, Rodney Brooks argued for featurebased representations brooks1991intelligence, as done by recent work in endtoend deep learning
DBLP:journals/corr/LevineFDA15 . Whether explicit 3D representations are available in human cortex has not so far found a definite answer 10.1093/cercor/bhv229 . To 3D, or not to 3D, that is the question atkeson .Our work answers this question in the affirmative, and proposes learningbased 3D feature representations in place of previous humanengineered ones, such as 3D pointclouds, meshes, or voxel occupancies. We propose recurrent neural networks equipped with a 3D representation bottleneck as their latent state, as shown in Figure 1. We train these networks to predict 2D and 3D abstractions of the world scene from various camera viewpoints, emulating the supervision available to an embodied mobile agent. We show that their latent 3D space learns to form plausible 3D imaginations of the scene given 2D or 2.5D video streams as input: it inpaints visual information behind occlusions, infers 3D extent of objects, and correctly predicts object occlusions and disocclusions. At their core, the proposed models estimate egomotion in each step and stabilize the extracted deep features before integrating them into the geometrically consistent 3D feature map. This is similar to what Simultaneous Localization and Mapping (SLAM) mur2017orb methods do when accumulating depth maps into a 3D scene pointcloud. Our models, which we call 3Dbottlenecked RNNs, use trainable SLAMlike modules with 2D and 3D deep features in place of depth maps or 3D pointclouds.
We handle multimodality and stochasticity during prediction with view contrastive losses in place of RGB regression: at each timestep, we project the accumulatedthusfar 3D feature map to a desired viewpoint, and optimize for matching the projected features against 2D or 3D features extracted bottomup from the view under consideration. We show the proposed contrastive loss outperforms RGB view regression in semisupervised learning of 3D object detectors, as well as in estimating 2D and 3D visual correspondences. Our models are trained in an endtoend differentiable manner by backpropagating prediction errors.
We show that estimating dense 3D motion from frame to frame in the latent 3D imagination space allows labelfree discovery of moving objects: any nonzero motion found while aligning consecutive 3D imaginations is an indication of independently moving entities, since camera motion has been cancelled. We show the proposed 3D object discovery outperforms alternatives of recent works Kosiorek:2018:SAI:3327757.3327951 ; NIPS2018_7333 that operate in 2D using synthetic moving MNIST or backgroundsubtracted staticcamera 2D videos, as well as 2.5D motion segmentation baselines.
Our models build upon the work of Tung et al. commonsense
, which also considered RNNs with a 3D representation bottleneck and egomotionstabilized updates, similar to ours but trained for view regression. Their model was tested in very simplistic simulation environments, using two degreeoffreedom camera motion and only static scenes (i.e., without moving objects). We compare against their model in our experimental section and show features learnt under the proposed viewcontrastive losses are more semantically meaningful.
Summarizing, we have the following contributions over prior works: (1) We propose novel view contrastive losses and show that they can scale to photorealistic scenes and learn semantic features for semisupervised learning of 3D object detectors and 2D/3D visual correspondence, outperforming pixel regression Dosovitskiy17 ; commonsense and VAE Eslami1204 alternatives. (2) We propose a novel 3D moving object segmentation method by estimating and thresholding 3D motion in a latent egomotionstabilized 3D feature space; we show it outperforms 2.5D baselines and iterative generative whatwhere VAEs Kosiorek:2018:SAI:3327757.3327951 ; NIPS2018_7333 of previous works.
Our model attempts to provide a scalable computational model of predictive coding rao1999predictive ; friston2003learning and mapping cogmap ; egobrain theories from cognitive neuroscience, by predicting future observations in 2D and 3D abstract feature space, while simultaneously building an internal geometrically consistent 3D map of the world scene. Our code and data will be made publicly available upon publication.
2 Related Work
Predictive visual feature learning Predictive coding theories rao1999predictive ; friston2003learning suggest that the brain predicts observations at various levels of abstraction. These theories currently have extensive empirical support: stimuli are processed more quickly if they are predictable mcclelland1981interactive ; pinto2015expectations , prediction error is reflected in increased neural activity rao1999predictive ; brodski2015faces , and disproven expectations lead to learning schultz1997neural
. Recent work in unsupervised learning has successfully used these ideas to learn word representations by predicting neighboring words
mikolov2013efficient. Many challenges emerge going from a finite word vocabulary to the continuous high dimensional image data manifold. Unimodal losses such as mean squared error are not very useful when predicting high dimensional data, due to the stochasticity of the output space. Researchers have tried to handle such stochasticity using latent variable models
loehlin1987latent or autoregressive prediction of the output pixel space, which involves sampling each pixel value from a categorical distribution van2016conditional conditioned on the output thus far. Another option is to make predictions in a latent feature space. Recently, the work of Oord et al. oord2018representation followed this direction and used an objective that preserves mutual information between the future bottomup extracted features and the predicted contextual latent features; it applied it in speech, text and image patches in static images. The view contrastive loss proposed in this work is a nonprobabilistic version of their contrastive objective. However, our work focuses on the video domain as opposed to static image patches, and uses drastically different architectures for both the contextual and bottomup representations, going through a 3dimensional representation bottleneck, in contrast to oord2018representation . Alongside future image prediction rao1999predictive ; Eslami1204 ; DBLP:journals/corr/TatarchenkoDB15 ; commonsense , predicting some form of contextual or missing information has also been explored, such as predicting frame ordering lee2017unsupervised , spatial context DBLP:journals/corr/DoerschGE15 ; pathakCVPR16context , color from grayscale vondrick2018tracking , egomotion DBLP:journals/corr/JayaramanG15 ; DBLP:journals/corr/AgrawalCM15 and future motion trajectories DBLP:journals/corr/WalkerDGH16 .Deep geometry Some recent works have attempted various forms of mapbuilding gup ; henriques2018mapnet as a form of geometricallyconsistent temporal integration of visual information, in place of geometryunaware vanilla LSTM hochreiter1997long or GRU cho2014learning models. The closest to ours are Learnt Stereo Machines (LSMs) LSM , DeepVoxels sitzmann2018deepvoxels , and Geometryaware RNNs (GRNNs) commonsense
that integrate images sampled from a viewing sphere into a latent 3D feature memory tensor, in an egomotionstabilized manner, and predict views. All of these works consider very simplistic nonphotorealistic environments. None of these prior works evaluates the suitability of their learnt features for a downstream task. Rather, accurately predicting views is their main objective.
Motion estimation and object segmentation Most motion segmentation methods operate in 2D image space, and us 2D optical flow to segment moving objects. While earlier approaches attempted motion segmentation completely unsupervised by exploiting motion trajectories and integrating motion information over time springerlink:10.1007/9783642155550_21 ; OB11 , recent works focus on learning to segment objects in videos, supervised by annotated video benchmarks Fragkiadaki_2015_CVPR ; DBLP:journals/corr/KhorevaBIBS17 . Our work differs in the fact that we address object detection and segmentation in 3D as opposed to 2D, by estimating 3D motion of the “imagined" (complete) scene, as opposed to 2D motion of the observed scene.
3 Embodied viewcontrastive 3D feature learning
We consider a mobile agent that can move its camera at will. In static scenes, our agent learns 3D visual feature representations by predicting 2D and 3D features of the view seen from a query camera viewpoint, and backpropagating the prediction errors (Section 3.1). In dynamic scenes, our agent learns to estimate dense 3D motion fields that bring its egomotionstabilized 3D feature imaginations of consecutive timesteps into correspondence, and proposes 3D object segmentations by thresholding motion magnitude (Section 3.2). In both cases, learning relies solely on “labels" provided by the moving agent itself, that moves and watches objects move, without any human supervision.
3.1 Learning 3D feature representations by moving
Our model’s architecture is illustrated in Figure 1 (top). It is an RNN with a 4D latent state , which has spatial resolution (width, height, and depth) and feature dimensionality (channels). The latent state aims to capture a geometrically consistent 3D deep feature map of the 3D world scene. We refer to the memory state as the model’s imagination to emphasize that most of the grid points in will not be observed by any sensor, and so the feature content must be “imagined" by the model. Our model has a set of differentiable modules to go back and forth between 3D feature imagination space and 2D image space. It builds upon the recently proposed geometryaware recurrent neural networks (GRNNs) of Tung et al. commonsense , which also have a 4D egomotionstabilized latent space, and are trained for RGB view prediction. We will use the terms GRNNs and 3Dbottlenecked RNNs interchangeably.
We briefly describe each module for completeness, and refer the reader to our supplementary file for more details. In comparison to Tung et al. commonsense , (i) our egomotion module can handle general (as opposed to twodegreesoffreedom) camera motion, and show it performs on par with geometric methods in Section 4, and (ii) our 3Dto2D projection module decodes the 3D map into 2D feature maps seen from the viewpoint under consideration, as opposed to an RGB image.
2Dto3D unprojection This module converts the input RGB image and depth map into a 4D tensor , by filling the 3D imagination grid with samples from the 2D image pixel grid using perspective (un)projection, and mapping the depth map to a binary occupancy voxel grid , by assigning each voxel a value of 1 or 0, depending on whether or not a point lands in the voxel.
Latent map update This module aggregates egomotionstabilized (registered) feature tensors into the memory tensor . We denote registered tensors with a subscript reg. We first pass the registered tensors through a series of 3D convolution layers, producing a 3D feature tensor for the timestep, denoted . On the first timestep, we set . On later timesteps, our memory update is computed using a running average operation.
Egomotion estimation This module computes the relative 3D rotation and translation between the current camera pose (from timestep ) and the reference pose (from timestep ) of the latent 3D map. Our egomotion module can handle general motion and uses spatial pyramids, incremental warping, and cost volume estimation via 6D crosscorrelation, inspired by PWCNet sun2018pwc , a stateoftheart 2D optical flow method. For details, please refer to the supplementary file.
3Dto2D projection This module “renders” 2D feature maps given a desired viewpoint by projecting the 3D feature state . We first orient the state map by resampling the 3D feature map into a viewaligned version , and map it to a set of 2D feature maps with 2D convolutions. We denote the final output of this 3Dto2D process as .
3.1.1 Viewcontrastive 3D predictive learning
Given a set of (random) input views , we train our model to predict feature abstractions of the view seen from a (random) desired viewpoint , as shown in Figure 1. We learn these abstractions endtoend using view contrastive metric learning. Specifically, we consider two representations for the target view : a contextbased one , and a bottomup one . Note that the contextbased representation has access to the viewpoint (pose) but not the view , while the bottomup representation is a function of the view . The metric learning loss ties the contextbased and bottomup representations together.
We explore two variants for our contrastive predictive learning, one 2dimensional and one 3dimensional, as shown in Figure 1. These differ in the space where we compute the inner product between contextbased and bottomup representations. We obtain the contextual tensor by orienting the 3D feature map built thus far to the query viewpoint . We obtain the contextual tensor by projecting using the 3Dto2D module. We obtain the bottomup tensor by feeding to the 2Dto3D unprojection module. We obtain the bottomup tensor by convolving with a residual convolutional network. The corresponding contrastive losses read:
(1)  
(2) 
where is a margin parameter, is a learned variable controlling the position of the boundary, indicates when corresponds to , and similarly for the 3D equation. The contrastive losses pull corresponding (contextual and bottomup) features close together in embedding space, and push noncorresponding ones beyond a margin of distance. It has been shown that the performance of a metric learning loss depends heavily on the sampling strategy used schroff2015facenet ; DBLP:journals/corr/SongXJS15 ; sohn2016improved . We use the distanceweighted sampling strategy proposed by Wu et al. wu2017sampling which uniformly samples “easy” and “hard” negatives wu2017sampling ; lehnensphere ; we find this outperforms both random sampling and semihard schroff2015facenet negative sampling.
Intuitively, the proposed corresponds to a discriminative alternative to RGB pixel or deep feature L2 error. Our 3D metric learning loss asks 3D feature tensors depicting the same scene, but acquired from different viewpoints, to contain the same feature content. This 3D consistency is not necessarily the case when training under a standard pixel regression or loss. This is is important for accurate dense 3D motion estimation, described next.
3.2 Learning to segment 3D objects by watching them move
Though the model is trained using sequences of frames as input, upon training it learns to map a single RGBD image to a complete 3D imagination, as we show in Figure 1 (bottom). This is in contrast to SLAM methods, which do not learn or improve with experience, but always need to observe a scene from multiple viewpoints in order to reconstruct it. We freeze the imagination model’s weights, and use its features to estimate dense 3D motion in the latent 4D feature space.
Specifically, given two temporally consecutive (registered) 3D maps predicted independently using images , and depth maps ,
as input, we train a 3D FlowNet to predict the dense 3D motion field between them. Since there is no camera motion, this 3D motion field will only be nonzero for independently moving objects. Our 3D FlowNet works by iterating across scales in a coarsetofine manner. At each scale, we compute a 3D cost volume, convert these costs to 3D displacement vectors and incrementally warp the two tensors to align them, essentially retargeting a stateoftheart optical flow method
sun2018pwc to operate in a 3D instead of 2D pixel grid. We train our 3D FlowNet supervised using synthetic augmentations, by rigidly transforming the first map and asking the FlowNet to recover the dense 3D flow field that corresponds to this transformation, as well as via a warping error, where we backwarp to align it with , and backpropagate the L1 of the difference. This extends selfsupervised 2D flow works back_to_basics:2016 to 3D feature constancy (instead of 2D brightness constancy). We found that both the synthetic and warpbased supervision are essential for obtaining accurate 3D flow field estimates.The computed 3D flow enjoys the benefits of permanent 3D feature content over time, and the lack of projective distortions. While 2D optical flow suffers from occlusions and disocclusions of image content Sun:CVPR:10 (for which 2D flow values are undefined) the proposed 3D imagination flow is always welldefined. Moreover, object motion in 3D does not suffer from projection artifacts, that transform rigid 3D transformations into nonrigid 2D flow fields. While 3D scene flow Hornacek_2014_CVPR concerns visible 3D points, 3D imagination flow is computed between visual features that may never have appeared in the field of view, but are rather inpainted by imagination. Since we are not interested in the 3D motion of empty air voxels, we additionally learn to estimate 3D voxel occupancy supervised by input depth maps, and set the 3D motion of all unoccupied voxels to zero. We describe our 3D occupancy learning in the supplementary file.
We obtain 3D object segmentation proposals by thresholding the 3D imagination flow magnitude, and clustering voxels using connected components. We score each component using a 3D version of a centersurround motion saliency score employed by numerous works for 2D motion saliency detection 19146246 ; Mahadevan10spatiotemporalsaliency . This score is high when the 3D box interior has lots of motion but the surrounding shell does not. This results in a set of scored 3D segmentation proposals for each video scene.
Our work is close to a long line of works in unsupervised 2D motion segmentation springerlink:10.1007/9783642155550_21 ; OB11 . We show that be applying similar methods in an egomotionstabilized 3D imagination space, as opposed to 2D pixel space, simple background subtraction suffices for finding moving objects.
4 Experiments
We test our models in CARLA Dosovitskiy17 , an opensource photorealistic simulator of urban driving scenes, which permits moving the camera to any desired viewpoint in the scene. We collect video sequences from CARLA’s City1 as the training set, and videos from City2 as the test set. Further details regarding the CARLA dataset collection can be found in the supplementary file.
Rendered images from the CARLA simulator have complex textures and specularities and are close to photorealism, which causes RGB view prediction methods to fail. In Figure 2, we show images predicted by the RGB regression of Tung et al. commonsense , (ii) a VAE alternative of Tung et al. commonsense , and Generative Query Networks (GQN) of Eslami et al. Eslami1204 which do not have a 3D representation bottleneck. We visualize our model’s predictions as colorized PCA embeddings, for interpretability. In the supplementary material we evaluate our method and the baselines in 2D and 3D visual correspondence and retrieval tasks, and show our model dramatically outperforms the baselines.
Our work uses view prediction as a means of learning useful visual representation for 3D object detection, segmentation and motion estimation, not as the end task itself. Our experiments below evaluate the model’s performance on (1) motion estimation, (2) unsupervised moving object segmentation, and (3) semisupervised 3D object detection.
4.1 3D motion estimation
We first evaluate the accuracy of the our 3D FlowNet. Note that our 3D FlowNet operates after egomotion stabilization, so to isolate the error due to the flow estimation alone (and not due to egomotion error), we use groundtruth egomotion to stabilize the scene. Results are shown in Table 1. We compare our model against the RGB view prediction baseline of Tung et al. commonsense : for both models we use the same 3D flow estimation described in Section 3.2; what differs are the input 3D feature maps . We also use a zeromotion baseline that predicts zero motion everywhere. Since approximately 97% of the voxels belong to the static scene a zeromotion baseline is very competitive in an overall average. We therefore score error separately in static and moving parts of the scene. Our method shows dramatically lower error than the RGB prediction baseline. In Table 2, we evaluate our egomotion module against ORBSLAM2 mur2017orb , a stateoftheart geometric SLAM method, and show comparable performance.
4.2 Unsupervised 3D object motion segmentation
We test the proposed 3D object motion segmentation in a dataset of twoframe video sequences of dynamic scenes. We compare (i) our full model (trained with ), (ii) the same model but finetuned on the test data, (iii) the RGBbased model of Tung et al. commonsense , and (iv) a 2.5D baseline (2.5D PWCNet). The 2.5D baseline computes 2D optical flow (using PWCFlow sun2018pwc ), then thresholds the magnitudes and obtains mask proposals in 2D; these proposals are mapped to 3D boxes using the input depth.
Our 3D motion estimation on the egostabilized space is affected by the quality of the egomotion estimate. We thus evaluate it in the follow setups: (i) the camera is stationary (S), (ii) the camera is moving from to but we stabilize using groundtruth egomotion (GE), (iii) the camera is moving from to and we stabilize using our estimated egomotion (EE). We show precisionrecall curves of 3D object detection in Figure 4.3.
Our contrastive model outperforms all baselines. Finetuning on the test scenes with view contrastive prediction slightly improves the features and permits better 3D dense motion estimation and segmentation. Using estimated egomotion instead of groundtruth only incurs a small cost in performance. The standard RGB regression loss of Tung et al. commonsense leads to relatively imprecise object detections. The 2.5D baseline performs reasonably in the staticcamera setting, but fails when the camera moves.
We have also attempted to compare against the VAE methods proposed in Kosiorek:2018:SAI:3327757.3327951 ; NIPS2018_7333 . In those models, the inference network uses the full video frame sequences to predict the locations of object bounding boxes, as well as frametoframe displacements, in order to minimize view prediction error. We were not able to produce meaningful results with the inference network. We attribute this failure to the difficulty of localizing 3D objects starting from random initialization. The success of NIPS2018_7333 may partially depend on carefully selected priors for 2D bounding box location and size parameters that match the moving MNIST dataset statistics, as suggested by the publicly available code; we do not assume knowledge or existence of such priors for our CARLA data.
4.3 Semisupervised learning of 3D object detection
In this section, we evaluate whether the proposed selfsupervised 3D predictive learning improves the generalization of 3D object detectors when combined with a small labelled training set (note that in the previous section we obtained 3D segmentations without any 3D box annotations). We consider two evaluation setups: (i) Pretraining features with selfsupervised viewcontrastive prediction (in both training and test sets), then training a 3D object detector on top of these features in a supervised manner, while either freezing or finetuning the initial features. (ii) Cotraining supervised 3D object detection with selfsupervised viewcontrastive prediction in both train and test sets. Both regimes, pretraining and cotraining, have been previously used in the literature to evaluate the suitability of selfsupervised (2D) deep features when repurposed for a semantic 2D detection task DBLP:journals/corr/DoerschGE15 ; here, we repurpose 3D feature representations for 3D object detection. Our 3D object detector is an adaptation of the stateoftheart 2D object detector, FasterRCNN DBLP:journals/corr/HeGDG17 , to have 3D input and output instead of 2D. The 3D input is simply our model’s hidden state .
Pretraining and cotraining results are shown in Table 3. We show the mean average precision of the predicted 3D boxes in the test sets for three different IOU thresholds. In every setting, on every considered threshold, both RGBbased and contrastive predictive training improve mean average precision (mAP) over the supervised baseline; the proposed viewcontrastive loss gives the largest performance boost, especially in the pretraining case.
We next evaluate how results vary across the number of labelled examples used for supervision. We compare models pretrained with contrastive view prediction against models trained from scratch. We vary the amount of supervision from labelled examples to . In the pretraining case, we freeze the feature layers after unsupervised viewcontrastive learning, and only supervise the detector; in the supervised case, we train endtoend. As expected, the fullysupervised models perform better as more labelled data is added. In lowdata settings, pretraining greatly improves results (e.g., 0.53 vs. 0.39 mAP at 500 labels). When all labels are used, the semisupervised and supervised models are equally strong. We additionally evaluate a set of models pretrained (unsupervised) on all available data (train+test); these perform slightly better than the models that only use the training data. Overall, the results suggest that contrastive view prediction leads to features that are relevant for object detection, which is especially useful when few labels are available.
Method  mAP  

@0.33 IOU  @0.50 IOU  @0.75 IOU  
Pretraining then freezing the feature encoder  
No pretraining (i.e., random feature encoder)  .77  .70  .34 
Pretraining with generative objective commonsense  .88  .81  .39 
Pretraining with contrastive objective  .90  .83  .53 
Pretraining then finetuning endtoend  
Pretraining with generative objective commonsense  .91  .84  .55 
Pretraining with contrastive objective  .91  .85  .59 
Cotraining with view prediction  
No cotraining (i.e., supervised baseline)  .91  .83  .63 
Cotraining with generative objective commonsense  .92  .88  .64 
Cotraining with contrastive objective  .95  .90  .69 

Limitations
The proposed model has two important limitations. First, the 3D latent space makes heavy use of GPU memory, which limits either the resolution or the metric span of the latent map. Second, our results are conducted in simulated environments. Our work argues in favor of embodiment, but this is hard to realize in practice, as collecting multiview data in the real world is challenging. Training and testing our model in the real world with lowcost robots DBLP:journals/corr/abs180707049 is a clear avenue for future work.
5 Conclusion
We propose models that learn 3D latent imaginations of the world given 2.5D input by minimizing 3D and 2D view contrastive prediction objectives. They further detect moving objects by estimating and thresholding residual 3D motion in their latent (stabilized) imagination space, generalizing background subtraction to arbitrary camera motion and scene geometry. Our empirical findings suggest that embodiment—equipping artificial agents with cameras to view the world from multiple viewpoints—can substitute human annotations for 3D object detection. “We move in order to see and we see in order to move", said J.J. Gibson in 1979 Gibson1979GIBTEA . Our work presents 3D viewcontrastive predictive learning as a working scalable computational model of such motion supervision, and demonstrates its clear benefits over previous methods.
Acknowledgements
We thank Christopher G. Atkeson for providing the historical context of the debate on the utility and role of geometric 3D models versus featurebased models, and for many other helpful discussions.
References
 (1) Russell A Epstein, E Z Patai, Joshua Julian, and Hugo Spiers. The cognitive map in humans: Spatial navigation and beyond. Nature Neuroscience, 20:1504–1513, 10 2017.
 (2) Pulkit Agrawal, João Carreira, and Jitendra Malik. Learning to see by moving. CoRR, abs/1505.01596, 2015.
 (3) Christopher G. Atkeson. personal communication, February 2, 2019.
 (4) Simon Baker and Iain Matthews. Lucaskanade 20 years on: A unifying framework. International journal of computer vision, 56(3):221–255, 2004.
 (5) Alla Brodski, GeorgFriedrich Paasch, Saskia Helbling, and Michael Wibral. The faces of predictive coding. Journal of Neuroscience, 35(24):8997–9006, 2015.

(6)
Rodney A Brooks.
Intelligence without reason.
In
Proceedings of the 12th International Joint Conference on Artificial Intelligence
, 1991.  (7) Thomas Brox and Jitendra Malik. Object segmentation by long term analysis of point trajectories. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, ECCV, pages 282–295, 2010.
 (8) Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
 (9) Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An InformationRich 3D Model Repository. arXiv preprint arXiv:1512.03012, 2015.
 (10) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 (11) Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. CoRR, abs/1505.05192, 2015.
 (12) Reinder Dorman and Raymond van Ee. 50 years of stereoblindness: Reconciliation of a continuum of disparity detectors with blindness for disparity in near or far depth. iPerception, 8(6):2041669517738542, 2017.
 (13) Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017.

(14)
S. M. Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S.
Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol
Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan
Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan
Wierstra, Koray Kavukcuoglu, and Demis Hassabis.
Neural scene representation and rendering.
Science, 360(6394):1204–1210, 2018.  (15) Katerina Fragkiadaki, Pablo Arbelaez, Panna Felsen, and Jitendra Malik. Learning to segment moving objects in videos. In CVPR, June 2015.
 (16) Erez Freud, Tzvi Ganel, Ilan Shelef, Maxim D. Hammer, Galia Avidan, and Marlene Behrmann. ThreeDimensional Representations of Objects in Dorsal Cortex are Dissociable from Those in Ventral Cortex. Cerebral Cortex, 27(1):422–434, 10 2015.
 (17) Karl Friston. Learning and inference in the brain. Neural Networks, 16(9):1325–1352, 2003.
 (18) Dashan Gao, Vijay Mahadevan, and Nuno Vasconcelos. On the plausibility of the discriminant centersurround hypothesis for visual saliency. Journal of vision, 8, 2008.
 (19) Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.
 (20) James J. Gibson. The Ecological Approach to Visual Perception. Houghton Mifflin, 1979.
 (21) Abhinav Gupta, Adithyavairavan Murali, Dhiraj Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. CoRR, abs/1807.07049, 2018.

(22)
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra
Malik.
Cognitive mapping and planning for visual navigation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2616–2625, 2017.  (23) Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
 (24) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask RCNN. CoRR, abs/1703.06870, 2017.
 (25) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
 (26) J. F. Henriques and A. Vedaldi. MapNet: An allocentric spatial memory for mapping environments. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 (27) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 (28) Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial intelligence, 17(13):185–203, 1981.
 (29) Michael Hornacek, Andrew Fitzgibbon, and Carsten Rother. SphereFlow: 6 DoF scene flow from RGBD pairs. In CVPR, 2014.
 (30) JunTing Hsieh, Bingbin Liu, DeAn Huang, Li F FeiFei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 517–526. Curran Associates, Inc., 2018.
 (31) Dinesh Jayaraman and Kristen Grauman. Learning image representations equivariant to egomotion. CoRR, abs/1505.02206, 2015.
 (32) Abhishek Kar, Christian Häne, and Jitendra Malik. Learning a multiview stereo machine. CoRR, abs/1708.05375, 2017.
 (33) Anna Khoreva, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. Lucid data dreaming for object tracking. CoRR, abs/1703.09554, 2017.
 (34) Adam R. Kosiorek, Hyunjik Kim, Ingmar Posner, and Yee Whye Teh. Sequential attend, infer, repeat: Generative modelling of moving objects. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 8615–8625, USA, 2018. Curran Associates Inc.
 (35) HsinYing Lee, JiaBin Huang, Maneesh Singh, and MingHsuan Yang. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 667–676, 2017.
 (36) Al Lehnen and Gary E Wesenberg. The sphere game in n dimensions. 2006.
 (37) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. CoRR, abs/1504.00702, 2015.

(38)
ChenHsuan Lin and Simon Lucey.
Inverse compositional spatial transformer networks.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.  (39) John C Loehlin. Latent variable models: An introduction to factor, path, and structural analysis. Lawrence Erlbaum Associates, Inc, 1987.
 (40) Vijay Mahadevan and Nuno Vasconcelos. Spatiotemporal saliency in dynamic scenes. TPAMI, 32, 2010.
 (41) James L McClelland and David E Rumelhart. An interactive activation model of context effects in letter perception: I. an account of basic findings. Psychological review, 88(5):375, 1981.
 (42) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 (43) Raul MurArtal and Juan D Tardós. Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
 (44) P. Ochs and T. Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In ICCV, 2011.
 (45) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 (46) Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
 (47) Yair Pinto, Simon van Gaal, Floris P de Lange, Victor AF Lamme, and Anil K Seth. Expectations accelerate entry of visual stimuli into awareness. Journal of Vision, 15(8):13–13, 2015.
 (48) Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extraclassical receptivefield effects. Nature neuroscience, 2(1):79, 1999.
 (49) Lawrence Roberts. Machine perception of threedimensional solids. PhD thesis, MIT, 1965.

(50)
Florian Schroff, Dmitry Kalenichenko, and James Philbin.
FaceNet: A unified embedding for face recognition and clustering.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.  (51) Wolfram Schultz, Peter Dayan, and P Read Montague. A neural substrate of prediction and reward. Science, 275(5306):1593–1599, 1997.
 (52) Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhöfer. DeepVoxels: Learning persistent 3d feature embeddings. arXiv preprint arXiv:1812.01024, 2018.
 (53) Kihyuk Sohn. Improved deep metric learning with multiclass npair loss objective. In Advances in Neural Information Processing Systems, pages 1857–1865, 2016.
 (54) Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. CoRR, abs/1511.06452, 2015.
 (55) Kasey C Soska and Scott P Johnson. Development of threedimensional object completion in infancy. Child development, 79(5):1230–1236, 2008.
 (56) Elizabeth S Spelke, J Mehler, M Garrett, and E Walker. Perceptual knowledge of objects in infancy. In NJ: Erlbaum Hillsdale, editor, Perspectives on mental representation, chapter 22. Erlbaum, 1982.
 (57) D. Sun, S. Roth, and M. J. Black. Secrets of optical flow estimation and their principles. In CVPR, 2010.
 (58) Deqing Sun, Xiaodong Yang, MingYu Liu, and Jan Kautz. PWCNet: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
 (59) Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Singleview to multiview: Reconstructing unseen views with a convolutional network. CoRR, abs/1511.06702, 2015.
 (60) HsiaoYu Fish Tung, Ricson Cheng, and Katerina Fragkiadaki. Learning spatial common sense with geometryaware recurrent networks. arXiv:1901.00003, 2018.
 (61) Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pages 4790–4798, 2016.
 (62) Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Tracking emerges by colorizing videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 391–408, 2018.

(63)
Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert.
An uncertain future: Forecasting from static images using variational autoencoders.
In ECCV, 2016.  (64) Matthew Wall and Andrew Smith. The representation of egomotion in the human brain. Current biology : CB, 18:191–4, 03 2008.
 (65) ChaoYuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2840–2848, 2017.
 (66) Jason J. Yu, Adam W. Harley, and Konstantinos G. Derpanis. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In ECCV, 2016.
Appendix A Architecture details for 3D bottlenecked RNNs
Our model builds upon the geometryaware recurrent neural networks (GRNNs) of Tung et al. commonsense . These models are RNNs that have a 4D latent state , which has spatial resolution (width, height, and depth) and feature dimensionality (channels). At each time step, we estimate the rigid transformation between the current camera viewpoint and the coordinate system of the latent map , then rotate and translate the features extracted from the current input view and depth map to align them with the coordinate system of the latent map, and convolutionally update the latent map using a standard convolutional 3D GRU, 3D LSTM or plain feature averaging. We found plain averaging to work well, while being much faster than GRU or LSTM. We refer to the memory state as the model’s imagination to emphasize that most of grid points in will not be observed by any sensor, and so the feature content must be “imagined" by the model. We use a resolution of for . The imagination space corresponds to a large box of Euclidean world space. We set this box to a size of meters, and place it in front of the camera’s position at the first timestep, oriented parallel to the ground plane. The input images are pixels. We trim input pointclouds to a maximum of 100,000 points, and to a range of 80 meters, to simulate a Velodyne sensor.
Below we present the individual modules in detail. The full set of modules allows the model to differentiably go back and forth between 2D pixel observation space and 3D imagination space.
2Dto3D unprojection
This module converts the input 2D image and depth map into a 4D tensor , by filling the 3D imagination grid with samples from the 2D image grid, using perspective (un)projection. Specifically, for each cell in the imagination grid, indexed by the coordinate , we compute the floatingpoint 2D pixel location that it projects to from the current camera viewpoint, using the pinhole camera model hartley2003multiple , where is the similarity transform that converts memory coordinates to camera coordinates and is the camera intrinsics (transforming camera coordinates to pixel coordinates). We fill
with the bilinearly interpolated pixel value
. We transform our depth map in a similar way and obtain a binary occupancy grid , by assigning each voxel a value of 1 or 0, depending on whether or not a point lands in the voxel. We concatenate this to the unprojected RGB, making the tensor .In order to learn a geometrically consistent 3D feature map, our model needs to register all unprojected tensors over time by cancelling the relative 3D rotation and translation between the camera viewpoints. We denote registered tensors with a subscript reg. We treat the first camera position as the reference system thus (and ). On later timesteps (assuming the camera moves), we obtain using a modified sampling equation, given by , where is the camera pose at timestep (transforming referencecamera coordinates to currentcamera coordinates). In this work, we assume access to groundtruth camera pose information at training time, since an active observer does have access to its approximate egomotion. At test time, we estimate the egomotion. Unlike the GRNN in Tung et al. commonsense , our GRNN’s camera is not restricted to 2D motion along a viewing sphere; instead, we consider a general camera with full 6DoF motion.
Latent map recurrent update
This module aggregates egomotionstabilized (registered) feature tensors into the memory tensor . We first pass the registered tensors through a series of 3D convolution layers, producing a 3D feature tensor for the timestep, denoted . On the first timestep, we set . On later timesteps, our memory update is computed using a running average operation. 3D convolutional LSTM or 3D convolutional GRU updates could be used instead.
The 3D feature encoderdecoder has the following architecture, using the notation 
for kernelstridechannels: 4264, 42128, 42256, 40.5128, 40.564, 11
, where is the feature dimension. We use. After each deconvolution (stride 0.5 layer) in the decoder, we concatenate the sameresolution featuremap from the encoder. Every convolution layer (except the last in each net) is followed by a leaky ReLU activation and batch normalization.
Egomotion estimation
This module computes the relative 3D rotation and translation between the current camera viewpoint and the reference coordinate system of the map . We significantly changed the module of Tung et al. commonsense which could only handle 2 degrees of camera motion. Our egomotion module is inspired by the stateoftheart PWCNet optical flow method sun2018pwc : it incorporates spatial pyramids, incremental warping, and cost volume estimation via crosscorrelation.
While and can be used directly as input to the egomotion module, we find better performance can be obtained by allowing the egomotion module to learn its own featurespace. Thus, we begin by passing the (unregistered) 3D inputs through a series of 3D convolution layers, producing a reference tensor , and a query tensor . We wish to find the rigid transformation that aligns the two.
We use a coarsetofine architecture, which estimates a coarse 6D answer at the coarse scale, and refines this answer in a finer scale. We iterate across scales in the following manner: First, we downsample both feature tensors to the target scale (unless we are at the finest scale). Then, we generate several 3D rotations of the second tensor, representing “candidate rotations", making a set , where is the discrete set of 3D rotations considered. We then use 3D axisaligned crosscorrelations between and the , which yields a cost volume of shape , where is the total number of spatial positions explored by crosscorrelation. We average across spatial dimensions, yielding a tensor shaped , representing an average alignment score for each transform. We then apply a small fullyconnected network to convert these scores into a 6D vector. We then warp according to the rigid transform specified by the 6D vector, to bring it into (closer) alignment with . We repeat this process at each scale, accumulating increasingly fine corrections to the initial 6D vector.
Similar to PWCNet sun2018pwc , since we compute egomotion in a coarsetofine manner, we need only consider a small set of rotations and translations at each scale (when generating the cost volumes); the final transform composes all incremental transforms together. However, unlike PWCNet, we do not repeatedly warp our input tensors, because this accumulates interpolation error. Instead, following the inverse compositional LucasKanade algorithm baker2004lucas ; lin2017inverse , and at each scale warp the original input tensor with the composed transform.
After solving for the relative pose, we generate the registered 3D tensors by reunprojecting the raw inputs (using the solved pose during unprojection). Note that an alternative registration method is to bilinearly warp the unregistered tensors using the relative pose (as done in Tung et al. commonsense ), but we find that this introduces noticeable and unnecessary interpolation error.
3Dto2D projection
This module “renders” 2D feature maps given a desired viewpoint by projecting the 3D feature state . We first appropriately orient the state map by resampling the 3D featuremap into a viewaligned version . The sampling is defined by , where is (as before) the similarity transform that brings imagination coordinates into camera coordinates, is the transformation that relates the reference camera coordinates to the viewing camera coordinates, are voxel indices in , and are voxel indices in . We then warp the vieworiented tensor such that perspective viewing rays become axisaligned. We implement this by sampling from the memory tensor with the correspondence , where the indices span the image we wish to generate, and spans the length of each ray. We use logarithmic spacing for the increments of , finding it far more effective than linear spacing (used in prior work commonsense ), likely because our scenes cover a large metric space. We call the perspectivetransformed tensor . To avoid repeated interpolation, we actually compose the view transform with the perspective transform, and compute from with a single trilinear sampling step. Finally, we pass the perspectivetransformed tensor through a series of 2D convolutional layers, converting it to a 2D feature map. We denote the final output of this 3Dto2D process as . Note that the previous work of Tung et al. commonsense decode directly to an RGB image using an LSTM residual decoder. We use their model as our baseline in the experimental section.
The view renderer has the following architecture: maxpool 88 along the depth axis, 3132 (3D conv), reshape from 4D to 3D, 3132 (2D conv), 11 (2D conv), where is the embedding/channel dimension. For predicting RGB, ; for metric learning, we use . We find that with dimensionality or less, the model underfits.
3D imagination FlowNet
To train our 3D FlowNet, we generate supervised labels from synthetic transformations of the input, and an unsupervised loss based on the standard standard variational loss horn1981determining ; back_to_basics:2016
. For the synthetic transformations, we randomly sample from three uniform distributions of rigid transformations: (i)
large motion, with rotation angles in the range (degrees) and translations in (meters), (ii) small motion, with angles from and translations from , (iii) zero motion. We found that without sampling (additional) small and zero motions, the model does not accurately learn these ranges. Still, since these synthetic transformations cause the entire tensor to move at once, a FlowNet learned from this supervision alone tends to produce overlysmooth flow in scenes with real (nonrigid) motion. The variational loss, described next, overcomes this issue.In the variational loss horn1981determining ; back_to_basics:2016 , we use a pair of consecutive frames with motion between them (including camera and object motion), estimate the flow, and backwarp the second tensor to align with the first. Compared to a standard optical flow method, we apply this loss in 3D with voxel features, rather than in 2D with image pixels:
(3) 
where is the memory tensor, and is the inversewarped tensor from the next timestep. We apply the warp with a differentiable 3D spatial transformer layer, which does trilinear interpolation to resample each voxel. Note that this loss makes an assumption of voxel feature constancy (instead of the traditional pixel brightness constancy)—and this assumption is made true across views by our metric learning loss.
We also apply a smoothness loss penalizing local flow changes: , where F is the estimated flow field and is the 3D spatial gradient. This is a standard technique to prevent the model from only learning motion edges horn1981determining ; back_to_basics:2016 .
Unlike 2D flow methods, our unsupervised 3D flow does not require special terms to deal with occlusions and disocclusions, since 3D flow is defined everywhere inside the grid.
Note the flow of freespace voxels is arbitrary. It is for this reason that we elementwise multiply the flow grid by the occupancy grid before attempting object discovery. From another perspective, this multiplication ensures that we only discover solid objects.
3D occupancy estimation
The goal in this step is to estimate which voxels in the imagination grid are “occupied” (i.e., have something visible inside) and which are “free” (i.e., have nothing visible inside).
For supervision, we obtain (partial) labels for both “free” and “occupied” voxels using the input depth data. Sparse “occupied” voxel labels are given by the voxelized pointcloud . To obtain labels of “free” voxels, we trace the sourcecamera ray to each occupied observed voxel, and mark all voxels intersected by this ray as “free”.
Our occupancy module takes the memory tensor as input, and produces a new tensor , with a value in
at each voxel, representing the probability of the voxel being occupied. This is achieved by a single 3D convolution layer with a
filter (or, equivalently, a fullyconnected network applied at each grid location), followed by a sigmoid nonlinearity. We train this network with the logistic loss,
(4) 
where is the label map, and is an indicator tensor, indicating which labels are valid. Since there are far more “free” voxels than “occupied”, we balance this loss across classes within each minibatch.
Bottomup network
The “bottomup” embedding net is a residual network DBLP:journals/corr/HeZRS15 with two residual blocks, with two convolutions and one transposed convolution in each block. The convolution layers’ channel dimensions are 64, 64, 64, 128, 128, 128. Finally there is one convolution layer with channels, where is the embedding dimension. We use .
Contrastive loss
For both the 2D and the 3D contrastive loss, for each example in the minibatch, we randomly sample a set of 960 pixel/voxel coordinates for supervision. Each coordinate gives a positive correspondence , since the tensors are aligned. For each , we sample a negative from the samples acquired across the entire batch, using the distanceweighted sampling strategy of Wu et al. wu2017sampling . In this way, on every iteration we obtain an equal number of positive and negative samples, where the negative samples are spread out in distance. We additionally apply an L1 loss on the difference between the entire tensors, which penalizes distance at all positive correspondences (instead of merely the ones sampled for the metric loss). We find this accelerates training. We use a coefficient of for , for , and for the L1 losses.
a.1 Code and training details
Our model is implemented in Python/Tensorflow, with custom CUDA kernels for the 3D cross correlation (used in the egomotion module and the flow module) and trilinear resampling (used in the 2Dto3D and 3Dto2D modules). The CUDA operations use less memory than nativetensorflow equivalents, which facilitates training with large imagination tensors (
).We train our models using a batch size of either 2 or 4, depending on the number of active modules. This is approximately the memory limit of a 12G Titan X Pascal GPUs, which is what we use to train our model. Training to convergence ( 200k iterations) takes approximately 48 hours. We use a learning rate of 0.001 for all modules except the 3D FlowNet, for which we use 0.0001. We use the Adam optimizer, with , .
Our code and data will be made publicly available upon publication.
Appendix B Additional experiments
b.1 Datasets
CARLA vs. other datasets
We test our method on scenes we collected from the CARLA simulator Dosovitskiy17 , an opensource driving simulator of urban scenes. CARLA permits moving the camera to any desired viewpoint in the scene, which is necessary for our viewbased learning strategy. Previous view prediction works have considered highly synthetic datasets: The work of Eslami et al. Eslami1204 introduced the Shepardmetzler dataset, which consists of seven colored cubes stuck together in random arrangements, and the RoomsRingCamera dataset, which consists of a random floor and wall colors and textures with variable numbers of shapes in each room of different geometries and colors. The work of Tung et al. commonsense introduced a ShapeNet arrangements dataset, which consists of table arrangements of ShapeNet synthetic models shapenet2015 . The work of DBLP:journals/corr/TatarchenkoDB15 considers scenes with a single car. Such highly synthetic and limitedcomplexity datasets cast doubt on the usefulness and generality of view prediction for visual feature learning. The CARLA simulation environments considered in this work have photorealistic rendering, and depict diverse weather conditions, shadows, and objects, and arguably are much closer to real world conditions, as shown in Figure 5. While there exist realworld datasets which are visually similar Geiger2013IJRR ; caesar2019nuscenes , they only contain a small number viewpoints, which makes viewpredictive training inapplicable.
CARLA data collection
We obtain data from the simulator as follows. We generate 1170 autopilot episodes of 50 frames each (at 30 FPS), spanning all weather conditions and all locations in both “cities” in the simulator. We define 36 viewpoints placed regularly along a 20mradius hemisphere in front of the egocar. This hemisphere is anchored to the egocar (i.e., it moves with the car). In each episode, we sample 6 random viewpoints from the 36 and randomly perturb their pose, and then capture each timestep of the episode from these 6 viewpoints. We generate train/test examples from this, by assembling all combinations of viewpoints (e.g., viewpoints as input, and unseen viewpoint as the target). We filter out frames that have zero objects within the metric “in bounds” region of the GRNN (). This yields 172524 examples: 124256 in City1, and 48268 in City2. We treat the City1 data as the “training” set, and the City2 data as the “test” set, so there is no overlap between the train and test images.
Method  2D patch retrieval  3D patch retrieval  

P@1  P@5  P@10  P@1  P@5  P@10  
2Dbottlenecked generative architectures  
GQN with 2D L1 loss Eslami1204  .00  .01  .02       
3Dbottlenecked generative architectures  
GRNN with 2D L1 loss commonsense  .00  .00  .00  .42  .57  .60 
GRNNVAE with 2D L1 loss & KL div.  .00  .00  .01  .34  .58  .66 
3D bottlenecked contrastive architectures  
GRNN with 2D contrastive loss  .14  .29  .39  .52  .73  .79 
GRNN with 2D & 3D contrastive losses  .20  .39  .47  .80  .97  .99 
b.2 2D and 3D correspondence
We evaluate our model’s performance in estimating visual correspondences in 2D and in 3D, using a nearestneighbor retrieval task. In 2D, the task is as follows: we extract one “query” patch from a topdown render of a viewpoint, then extract 1000 candidate patches from bottomup renders, with only one true correspondence (i.e., 999 negatives). We then rank all bottomup patches according to L2 distance from the query, and report the retrieval precision at several recall thresholds, averaged over 1000 queries. In 3D the task is similar, but patches are feature cubes extracted from the 3D imagination; we generate queries from one viewpoint, and retrievals from other viewpoints and other scenes. The 1000 samples are generated as follows: from 100 random test examples, we generate 10 samples from each, so that each sample has 9 negatives from the same viewpoint, and 990 others from different locations/scenes.
We compare the proposed model against (i) the RGB prediction baseline of Tung et al. commonsense , (ii) Generative Query Networks (GQN) of Eslami1204 that does not have a 3D representation bottleneck, (iii) a VAE alternative of the (deterministic) model of Tung et al. commonsense . We provide experimental details in the supplementary file.
Quantitative results are shown in Table 4. For 2D correspondence, the models learned through the RGB prediction objectives obtain precision near zero at each recall threshold, illustrating that the model is not learning precise RGB predictions. The proposed view contrastive losses perform better, and combining both the 2D and 3D contrastive losses is better than using only 2D. Interestingly, for 3D correspondence, the retrieval accuracy of the RGBbased models is relatively high. Training 3D bottlenecked RNNs as a variational autoencoder, where stochasticity is added in the 3D bottleneck, improves its precision at lower ranks thresholds. Contrastive learning outperforms all baselines. Adding the 3D contrastive loss gives a large boost over using the 2D contrastive loss alone. Note that 2Dbottlenecked architectures Eslami1204 cannot perform 3D patch retrieval. Qualitative retrieval results for our full model vs. Tung et al. commonsense are shown in Figure 6.
b.3 3D motion estimation
The error metrics presented in the main paper are in voxel units. These can be converted to meters by multiplying by the voxeltometer ratio, which is 4. For this evaluation (and for training) we subsample every 6th frame to obtain large motion. Sample flows are visualized against ground truth in Figure 7.
b.4 Unsupervised 3D object motion segmentation
Our method proposes 3D object segmentations, but labels are only available in the form of oriented 3D boxes; we therefore convert our segmentations into boxes by fitting minimumvolume oriented 3D boxes to the segmentations. The precisionrecall curves presented in the main paper are computed with an intersectionoverunion (IOU) threshold of . Figure 8 shows sample visualizations of the 3D box proposals projected onto input images.
b.5 Semisupervised object detection
The 3D FasterRCNN, similar to its 2D counterpart, outputs axisaligned boxes. We obtain axisaligned 3D groundtruth from the CARLA simulator. This is unlike the selfsupervised setting, where we produce oriented boxes (from motion segmentation) and evaluate against oriented ground truth.
b.6 Occupancy prediction
We test our model’s ability to estimate occupied and free space. Given a single view as input, the model outputs an occupancy probability for each voxel in the scene. Then, given the aggregated labels computed from this view and a random next view, we compute accuracy at all voxels for which we have labels. Voxels that are not intersected by either view’s camera rays are left unlabelled. Table 5 shows the classification accuracy, evaluated independently for free and occupied voxels, and for all voxels aggregated together. Overall, accuracy is extremely high (9798%) for both voxel types. Note that part of the occupied space (i.e., the voxelized pointcloud of the first frame) is an input to the network, so accuracy on this metric is expected to be high.
3D area type  Classification accuracy 

Occupied space  .97 
Free space  .98 
All space  .98 
We show a visualization of the occupancy grids in Figure 8 (right). We visualize the occupancy grids by converting them to heightmaps. This is achieved by multiplying each voxel’s occupancy value by its height coordinate in the grid, and then taking a max along the grid’s height axis. The visualizations show that the occupancy module learns to fill the “holes” of the partial view, effectively imagining the complete 3D scene.
Comments
There are no comments yet.