Learning Spatial Common Sense with Geometry-Aware Recurrent Networks

by   Hsiao-Yu Fish Tung, et al.
Carnegie Mellon University

We integrate two powerful ideas, geometry and deep visual representation learning, into recurrent network architectures for mobile visual scene understanding. The proposed networks learn to "lift" 2D visual features and integrate them over time into latent 3D feature maps of the scene. They are equipped with differentiable geometric operations, such as projection, unprojection, egomotion estimation and stabilization, in order to compute a geometrically-consistent mapping between the world scene and their 3D latent feature space. We train the proposed architectures to predict novel image views given short frame sequences as input. Their predictions strongly generalize to scenes with a novel number of objects, appearances and configurations, and greatly outperform predictions of previous works that do not consider egomotion stabilization or a space-aware latent feature space. We train the proposed architectures to detect and segment objects in 3D, using the latent 3D feature map as input--as opposed to any input 2D video frame. The resulting detections are permanent: they continue to exist even when an object gets occluded or leaves the field of view. Our experiments suggest the proposed space-aware latent feature arrangement and egomotion-stabilized convolutions are essential architectural choices for spatial common sense to emerge in artificial embodied visual agents.



There are no comments yet.


page 1

page 4

page 6

page 11

page 12

page 13

page 14

page 15


Geometry-Aware Recurrent Neural Networks for Active Visual Recognition

We present recurrent geometry-aware neural networks that integrate visua...

Embodied View-Contrastive 3D Feature Learning

Humans can effortlessly imagine the occluded side of objects in a photog...

Disentangling 3D Prototypical Networks For Few-Shot Concept Learning

We present neural architectures that disentangle RGB-D images into objec...

Embodied Language Grounding with Implicit 3D Visual Feature Representations

Consider the utterance "the tomato is to the left of the pot." Humans ca...

Visual Graphs from Motion (VGfM): Scene understanding with object geometry reasoning

Recent approaches on visual scene understanding attempt to build a scene...

3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators

We propose an action-conditioned dynamics model that predicts scene chan...

Structured Visual Search via Composition-aware Learning

This paper studies visual search using structured queries. The structure...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Internet vision versus robotic vision. Pictures taken by humans (top row) (and uploaded on the web) are the output of visual perception of a well-trained agent, the human photographer. The content is skillfully framed and the objects appear in canonical scales and poses. Pictures taken by mobile agents, such as a NAO robot during a robot soccer game (bottom row), are the input to such visual perception. The objects are often partially occluded and appear in a wide variety of locations, scales and poses. We present recurrent neural architectures for the latter, that integrate visual information over time to piece together the visual story of the scene.

Current state-of-the-art visual systems [11] accurately detect object categories that are rare and unfamiliar for many of us, such as gyromitra, a particular genus of mushroom (Figure 1 top left). Yet, they neglect the basic principles of object permanence or spatial awareness that a one-year-old child has developed: once the camera turns away, or a person walks in front of the gyromitra, its detection disappears and it is replaced by the objects detected in the new visual frame. We believe the ability of current visual systems to detect rare and exquisite object categories and their inability to carry out elementary spatial reasoning is due to the fact that they are trained to label object categories from static Internet photos

(contained in ImageNet and COCO datasets) using a

single frame as input. Our overexposure to Internet photos makes us forget how pictures captured by mobile agents look. Consider Figure 1. Internet photos are skillfully captured by human photographers, are well framed and show objects unoccluded, in canonical locations, scales and poses (top row). Instead, photos captured by NAO robots during a soccer game show objects in a wide variety of scales, poses, locations, and occlusion configurations (bottom row). Often, it would not even make sense to label objects in such images, as most objects appear only half-visible. In the case of Internet vision, the picture is the output of visual perception of a well-trained visual agent, the human photographer. In the case of mobile robotic vision, the picture is the input to such visual perception. Thus, different architectures may be needed for each.

Figure 2:

Geometry-aware Recurrent Neural Networks (GRNNs)

integrate visual information over time in a 3D geometrically-consistent GRU memory of the visual scene. At each frame, RGB images are unprojected

into corresponding 3D feature tensors, which are oriented to the coordinate frame of the memory map built thus far (2nd row). A 3D convolutional GRU memory is then updated using the egomotion-stabilized features as input.

We present Geometry-aware Recurrent Neural Network architectures, which we call GRNNs, that learn to “lift” 2D image features into 3D feature maps of the scene, while stabilizing against the egomotion of the agent. They are equipped with a 3-dimensional latent feature state: the latent feature vectors are arranged in a 3D grid, where every location of the grid encodes a 3D physical location in the scene. The latent state is updated with each new input frame using egomotion-stabilized convolutions, as shown in Figure

2. GRNNs learn to map 2D input visual features to a 3D latent feature map, and back, in a differentiable manner. To achieve such differentiable and geometrically-consistent mapping between the world scene and the 3D latent feature space, they are equipped with differentiable geometric operations, such as egomotion estimation and feature stabilization, 3D-to-2D projection, and 2D-to-3D unprojection, as shown in Figure 2. Beyond being space-aware, we do not impose any other constraints on the learned representations: they are free to encode whatever is relevant for the downstream task.

We train GRNNs in a self-supervised manner to predict image views from novel camera viewpoints, given short frame sequences as inputs. We empirically show GRNNs learn to predict novel views and strongly generalize to novel scenes with different number, appearances and configuration of objects. They greatly outperform geometry-unaware networks of previous works that are trained under the exact same view-prediction loss, but do not use egomotion-stabilized convolutions or a 3D latent space. We argue strong generalization is a necessary condition for claiming the ability to spatially reason. Furthermore, the resulting representations support scene arithmetic: adding/subtracting latent scene representations, and decoding the resulting representation from a particular viewpoint, matches the result of adding/subtracting 3D world scenes directly.

We train GRNNs in a supervised manner to localize objects in the scene, given short frame sequences as inputs. We use the latent 3D feature map as input to a region proposal and a segmentation network predicts 3D bounding boxes and binary voxel object occupancies, as shown in Figure 2. For this we adapted the well-known 2D object detector and segmentor of MaskRCNN [11] to operate in 3D. The resulting object detections and segmentations persist in time: they do not suffer from instantaneous occlusions and dis-occlusions or changes of the viewpoint: an object that is not visible in the current viewpoint it is still present in the latent feature map. By projecting the detected 3D objects in 2D views we obtain amodal [19] object boxes and segments, even under severe occlusions. Our code and datasets will be made publicly available.

2 Related Work

Deep geometry

Simultaneous Localization and Mapping (SLAM) [22, 15]

methods are purely geometric methods that build a 3D pointcloud map of the scene while estimating the motion of the observer. Our method builds multiple deep feature maps instead, which capture both the geometry and the semantics of the scene. Recently, there has been great interest in integrating learning and geometry for single view 3D object reconstruction

[24, 27], 3D object reconstruction from videos [17], depth and egomotion estimation from pairs of frames [26, 30], depth estimation from stereo images [8], estimation of 3D human keypoints from 2D keypoint heatmaps [28, 25]. Many of those works use neural network architectures equipped with some form of differentiable camera projection, so that the 3D desired estimates can be supervised directly using 2D quantities. They do not accumulate information across multiple camera viewpoints. For example, Tulsiani et al[24], Wu et al[27] and Zhou et al. [30] use a single image frame as input to predict a 3D reconstruction for a single object, or a 2D depth map for the entire scene, though they use multiple views to obtain supervision in the form of depth re-projection error. Learnt stereo machines (LSM) [14] integrate RGB information along sequences of random camera viewpoints into a latent 3D feature memory tensor, in an egomotion-stabilized way, similar to our method. Their goal is to 3D reconstruct a single object, as opposed to detect and segment (3D reconstruct) multiple objects in the scene, which our model does. They assume egomotion is given, while we also estimate egomotion. Moreover, they can only be trained supervised for the object 3D reconstruction task, while GRNNs can be trained self-supervised for view-prediction. The work of LSM has inspired though the models proposed in this paper.

MapNet [12], Cognitive mapping and planning [10], IQA [9] and Neural Map [18] construct 2D overhead maps of the scene by taking into account the egomotion of the observer, similar to our method. MapNet further estimates the egomotion, as opposed to assuming it known as the rest of the methods do. In IQA, objects are detected in each frame and detections are aggregated in the birdview map, whereas we detect objects using the 3D feature map as input.

The closest work to ours is potentially the work of Cheng at al. [3]

, which considers egomotion-stabilized convolutions and a 3D latent map, like ours, for segmenting objects in 3D. However, they assume egomotion is known, while we learn to estimate it, and their object detection pipeline uses certain heuristics in order to specify the number of objects in the scene by discretizing continuous voxel segmentation embeddings that they obtain with metric learning. We instead train 3D region proposal networks. Most importantly, they do not consider self-supervised learning via view prediction, which is one of the central contributions of this work, rather they exclusively focus on supervised voxel labelling using groundtruth 3D voxel occupancies provided by a simulator.

Self-supervised visual feature learning

Researchers have considered many self-supervised tasks to train visual representations without human labels. For example, works of [13, 1] train visual representation by predicting egomotion between consecutive frames, and works of [6, 23] predict novel views of a scene. In particular, the authors of generative query network (GQN) [6] argue that GQN learns to disentangle color, lighting, shapes and spatial arrangement without any human labels. We compare against their model in Section 4 and show GRNNs can strongly generalize beyond the training set, while GQN cannot. Such strong generalization suggests that 3D latent space and egomotion-stabilization are necessary architectural choices for spatial reasoning to emerge.

3D object detection

When LiDAR input is available, many recent works attempt detecting objects directly in 3D [31, 16, 29] from LiDAR and RGB streams captured from a self-driving car. They mostly use a single frame as input, while the proposed GRNNs integrate visual information in time. Furthermore, GRNNs are trained to recover spatial common sense in self-supervised ways, which is not the case for the supervised 3D object detectors of [31, 16, 29]. Extending GRNNs to scenes with independently moving objects is a clear avenue for future work.

3 Geometry-aware recurrent networks

The proposed GRNNs are recurrent neural networks whose latent state learns a 3D deep feature map of the visual scene. We use the terms 4D tensor and 3D feature map interchangeably, to denote a set of feature channels, each being 3-dimensional. The memory map is updated with each new camera view in a geometrically-consistent manner, so that information from 2D pixel projections that correspond to the same 3D physical point end up nearby in the memory tensor, as illustrated in Figure 3. This permits later convolutional operations to have a correspondent input across frames, as opposed to it varying with the motion of the observer. We believe this is a key for generalization. The main components of GRNNs are illustrated in Figure 3 and are detailed right below.

Figure 3: Neural components of GRNNs. RGB images are fed into a 2D U-net, the resulting deep features are unprojected to 4D tensors, fed into a 3D U-net, oriented to cancel the camera motion with respect to the 3D GRU memory state built thus far, and then used to update the 3D GRU memory state. The memory map is projected to specific viewpoints and decoded into a corresponding RGB image, or fed into a 3D detector/segmentor which we call 3D MaskRCNN, to predict 3D object bounding boxes and per object voxel occupancies.


At each timestep, we feed the input RGB image to a 2D convolutional encoder-decoder network with skip-connections (2D U-net [21]) to obtain a set of 2D feature maps . We then unproject all feature maps to create a 4D feature tensor as follows: For each ”cell” in the 3D feature grid indexed by , we compute the 2D pixel location which the center of the cell projects onto, from the current camera viewpoint:

where is the focal length of the camera. Then,

is filled with the bilinearly interpolated 2D feature vector at that pixel location

. All voxels lying along the same ray casted from the camera center will be filled with nearly the same image feature vectors. We further unproject the input 2D depthmap into a binary voxel occupancy grid that contains the thin shell of voxels directly visible from the current camera view. We compute this by filling all voxels whose unprojected depth value equals the grid depth value. When a depth sensor is not available, we learn to estimate the depthmap using a 2D U-net that takes the RGB image as input.

We multiply the feature tensors with the binary occupancy grid to get a final 4D feature tensor . The unprojected tensor enters a 3D encoder-decoder network with skip connections (3D U-net) to produce a resulting feature tensor

In the case of view prediction, we did not use depth as input, nor did we use a 2D U-net to estimate it. We use only the RGB input for a fair comparison with prior art.

Egomotion estimation and stabilization

Our model predicts the absolute elevation angle of the first camera view using a 2D convnet, and orients the 3D feature memory to have elevation. This essentially makes the memory to always be parallel to the ground plane. The azimuth of the 3D feature memory is chosen to be the azimuth of the first view in the input frame sequence. At each time step, we use a 3D cross-convolution operator to predict the relative elevation and azimuth between the current frame (viewpoint) and the feature memory map. Note that we can alternatively predict the (absolute) elevation directly from each input view, without cross-correlating to the memory built thus far. For the azimuth, since we need to estimate the relative azimuth to the first view, such cross-view comparison is necessary.

Our 3D cross-correlation network essentially adapts the method of Henriques et al. [12] to operate in 3D maps, as opposed to 2D (overhead) maps. Specifically, each feature tensor is rotated by different azimuth and elevation angles and results in a stack of rotated feature tensors , where are the total number of azimuths and elevation angles considered, respectively, after discretization. Similar to the bilinear interpolation used during unprojection, to fill in each feature voxel in a rotated tensor , we compute the 3D location where it is rotated from and insert the bilinearly interpolated feature value from the original tensor . We then compare each of the rotated feature maps with our current 3D feature memory

using matrix inner products, to produce a probability distribution over azimuth and elevation pairs:

where denotes matrix inner product. The resulting rotation is obtained by a weighted average of azimuth and elevation angles where weights are in . Finally, we orient the tensor to cancel the relative rotation with respect to our 3D memory map , we denote the oriented tensor as .

Recurrent map update

Once the feature tensor has been properly oriented, we feed

as input to a 3D convolutional Gated Recurrent Unit

[4] layer, whose hidden state is the memory map , as shown in Figure 3. The hidden state is initialized to zero at the beginning of the frame sequence. For our view prediction experiments (Section 4) we found that when the number of views is fixed to (), then average pooling () works equally well to using the GRU update equations, while being much faster.

Projection and decoding

Given a 3D feature state and a desired viewpoint , we first rotate the 3D feature map so that its depth axis is aligned with the query camera axis. We then generate for each depth value a corresponding projected feature map . Specifically, for each depth value, the projected feature vector at a pixel location is computed by first obtaining the 3D location it is projected from and then inserting bilinearly interpolated value from the corresponding slice of the 4D tensor . In this way, we obtain different projected maps, each of dimension . Our depths range from to , where is the distance to the center of the feature map, and are equally spaced.

Note that we do not attempt to determine visibility of features at this projection stage. The stack of projected maps is processed by 2D convolutional operations and is decoded using a residual convLSTM decoder, similar to the one proposed in [6], to an RGB image. We do not supervise visibility directly. The network implicitly learns to determine visibility and choose appropriate depth slices from the stack of projected feature maps.

We train GRNNs for view prediction in a self-supervised manner, and for 3D object detection in a supervised manner.

3.1 View prediction

Mobile agents have access to their egomotion, and can observe sensory outcomes of their motions and interactions. Training sensory representations to predict such outcomesis a useful form of supervision: it is free of human annotations and often termed as self-supervision since the “labels” are provided by the embodied agent herself. Can spatial common sense, the notion of objects and scenes, geometry, visibility and occlusion relationships, emerge in a self-supervised way by a mobile agent that moves around and observes the world? How do you evaluate such spatial common sense acquisition? We show below and in Section 4 that training GRNNs for view prediction results in high precision predictions that greatly outperform alternative geometry-unaware RNN architectures and strongly generalize beyond the training set to novel scenes with different number of objects, appearances and arrangements. Our agent can accurately predict how a novel scene looks from a different viewpoint, and the predictions show the agent has learnt that objects have 3D extent, that objects occlude each other when one is placed closer to the camera than the other, that object shape and appearance have spatial regularities, etc. .

We train GRNNs to predict the image the agent would see from a novel viewpoint, given a short view sequence as input. Given the 3D feature memory map and a query viewpoint, we orient the map to the query viewpoint, we project it to 2D as described above, and decode it to a pixel image via a residual convLSTM decoder, used also in [6]. We train our view prediction using a standard cross-entropy pixel matching loss, where the pixel intensity has been squashed into the range . Our model is end-to-end differentiable. Training and implementation details are included in the supplementary file.

3.2 3D object detection and segmentation

We train GRNNs in a supervised manner to predict 3D object bounding boxes and segmentation masks, using groundtruth 3D object boxes and 3D voxel segmentations from a simulator. We adapt MaskRCNN [11], a state-of-the-art object detector/segmentor, to operate with 3D input and output instead of 2D. Specifically, we consider every grid location in our 3D map to be a candidate 3D box centroid. At each time step, the 3D feature memory map is fed to a 3D region proposal network to predict positive anchor centroids, as well as the corresponding adjustment for the box center location , and the box dimensions, width, height and depth . Our 3D bounding box encoding is similar to the one proposed in VoxelNet [31]. We filter the proposed boxes using non-max suppression to reject highly overlapping ones. We train with a combination of classification and regression loss, following well established detector training schemes [20, 11]. The proposed 3D bounding boxes that have Intersection of Union (IoU) above a specific threshold with a corresponding groundtruth object box, are denoted as Regions of Interest (ROIs) and are used to pool segmentation features from their interior to predict 3D object segmentation, i.e., 3D voxel occupancy, as well as a second refinement of the predicted 3D box location and dimensions.

Object permanence

Even when an object is not visible in the current camera viewpoint, its features are present in the 3D feature memory, and our detector detects and segments it, as we show in Figure 6. In other words, object detections persist despite occlusions caused by camera motion, or anobject lying outside the field of view. Applying the detector in the latent 3D model of the scene as opposed to in the 2D visual observation space is beneficial. This model follows the physical laws of 3D non-intersection and object permanence, while 2D visual observations do not.

4 Experiments

The term “spatial common sense” is broad and concerns the ability to perceive and understand properties and regularities regarding spatial arrangements and motion that are shared by (“common to”) nearly all people. Such common sense includes the fact that objects have 3D shape as opposed to being floating 2D surfaces, the fact that scenes are comprised of objects, the 3D non-intersection principle, the fact that objects do not spontaneously disappear, and many others [7]. The model we propose in this work targets understanding of static scenes, that is, scenes that do not contain any independently moving objects, and that are viewed under a potentially moving observer. Thus, we restrict the term spatial common sense to refer to rules and regularities that can be perceived in static worlds. Our experiments aim to answer the following questions:

  1. Do GRNNs learn spatial common sense?

  2. Are geometric structural biases necessary for spatial common sense to emerge?

  3. How well do GRNNs perform on egomotion estimation and 3D object detection?

To answer those questions, We train and test our model and baselines from recent previous works in the tasks of view prediction and 3D object detection.

Figure 4: View prediction results for the proposed GRNNs and the tower model of Eslami et al. [6]. Columns from left to right show the three input views, the ground-truth image from the query viewpoint, the view predictions for GRNNs and for the tower baseline. The first two rows are from the ShapeNet arrangement test set of [3], and the next two rows are from the Shepard-Metzler test set of [6]. The last four rows show generalization to scenes with four objects from the ShapeNet arrangement dataset, while both models were trained only on scenes with two objects. GRNNs outperform the baseline by a large margin and strongly generalize under a varying number of objects.
Figure 5: Scene arithmetic from the proposed GRNNs and the model of Eslami et al. [6] (tower). Each row is a separate ”equation”. We start with the representation of the scene in the leftmost column, then subtract (the representation of) the scene in the second column, and add the (representation of the) scene in the third column. We decode the resulting representations into an image view. The groundtruth image is shown in the forth column. It is much more visually similar to the prediction of GRNNs than to the tower baseline.

4.1 View prediction

We consider the following simulation datasets: i) ShapeNet arrangement from  [3] which consists of scenes that have synthetic 3D object models from ShapeNet [2] arranged on a table surface, made to us available by the authors of [3]. The objects in this dataset belong to four object categories, namely, cups, bowls, helmets and cameras. We follow the same train/test split of ShapeNet [2] so that object instances which appear in the training scenes do not appear in the test scenes. Each scene contains two objects, and each image is rendered from a viewing sphere which has possible views with 3 camera elevations and 18 azimuths . There are 300 different scenes in the training set and 32 scenes with novel objects in the test set. ii) Shepard-metzler shapes dataset from  [6]. It contains scenes which consist of seven colored cubes stuck together in random arrangements. We use the train and test split of  [6]. iii) Rooms-ring-camera dataset from  [6], a random rooms environment consisting of random floor and wall colors and textures, and variable numbers of shapes in each room of different geometries and colors. Results on this dataset are shown in the supplementary material to save space.

We compare the proposed GRNNs against the recent ”tower” architecture of Eslami et al. [6], a 2D network trained under a similar view prediction loss, that has a 2D instead of 3D feature space, and no egomotion-stabilized convolutions. We use our own implementation, since no implementation was provided by the authors. The tower architecture takes as input each 2D image and performs a series of convolutions on it. The camera pose from which the image was taken is tiled on the width and height axes and then concatenated into the feature map after the third convolution. Finally, the feature maps from all views are combined via average pooling. Both our model and the baseline use the same autoregressive decoder network. For fairness of comparison, we use groundtruth egomotion rather than estimated egomotion in all view prediction experiments, and only RGB input (no depth input of depth estimation) for both our model and the tower baseline. In both the baseline and our model, we did not use any stochastic units for simplicity and speed of training. Adding stochasticity is part of our future work.

Test results from our model and baseline on test images in ShapeNet arrangements and Shepard-metzler datasets are shown in Figure 4. Reconstruction test error for the ShapeNet arrangement dataset are shown in Table 1. GRNNs have a much lower reconstruction test error than the tower baseline. In Figure 4, in the first four rows, the distribution of the test scenes matches the training scene distribution. Our model outperforms the baseline in visual fidelity. In Figure 4, in the last four rows, the test scene distribution does not match the training one: we test our model and baseline on scenes with four objects, while both models are trained on scenes with exactly two objects. In this case, our model shows strong generalization and outperforms by a margin than our geometry-unaware baseline of [6]. Indeed, the ability for spatial reasoning should not be affected by the number of the objects present in the scene. The results above suggest that geometry-unaware models may be merely memorizing views with small interpolation capabilities, as opposed to learning to spatially reason. We attribute this to their inability to represent space efficiently in their latent vectors, a problem the proposed architectures correct for.

Scene arithmetics

The learnt representations of GRNNs under view prediction are capable of scene arithmetic, as we show in Figure 5. The ability to add and subtract individual objects from 3D scenes just by adding and subtracting their corresponding latent representations demonstrates that our model disentangles what from where. In other words, our model learns to store object-specific information in the regions of the memory which correspond to the spatial location of the corresponding object in the scene. Therefore, it is relatively straightforward to carry out scene arithmetic with our model. Implementation details and more qualitative view prediction results are included in the supplementary material.

Tower GRNNs
ShapeNet arrangements
Table 1:

View prediction loss and the standard deviation

for the ShapeNet arrangement test set for two-object test scenes. Our model and baseline were trained on scenes that also contain two objects with different object instances.
Figure 6: 3D object detection and segmentation with GRNNs . Blue voxels denote groundtruth objects. Predicted 3D boxes and their corresponding predicted masks are show in red and green to aid visualization. Best seen in color.

4.2 Egomotion estimation

In this section, we quantify the error of our egomotion estimation component. We train our egomotion estimation module using groundtruth egomotion from a simulator, using the ShapeNet arrangement dataset. In Table 2, we show egomotion estimation error in elevation and azimuth angles. Our model improves its egomotion estimates with more views, since then a more complete feature map is compared against each input unprojected tensor.

# views one two three avg.
Table 2: Egomotion estimation error of GRNNs in elevation and azimuth angles for the ShapeNet arrangement test set using different number of views. The error decreases with more views integrated in the memory.

4.3 3D object detection and segmentation

We use again the ShapeNet arrangement dataset, and the train/test scene split of [3]. We use mean Average Precision () to score the performance of our model and baselines for 3D object detection and 3D segmentation. Mean average precision measures the area under the precision-recall curve. To score our model both on loose as well as tight regime, we vary the cutoff threshold of Intersection over Union (IoU) to be 0.33, 0.5 and 0.75 between our predictions and the groundtruth 3D boxes and masks. We consider four ablations for our model: predicted egomotion (pego) versus groundtruth egomotion (gtego) used, and predicted depth (pd) versus groundtruth depth (gtd) used as input. We use suffixes to indicate the model we use.

detection 2DRNN-gtego-gtd GRNN-gtego-pd GRNN-gtego-gtd GRNN-pego-gtd segmentation 2DRNN-gtego-gtd GRNN-gtego-pd GRNN-gtego-gtd GRNN-pego-gtd
Table 3: Mean Average Precision (mAP) for 3D object detection and 3D segmentation for three different thresholds of Intersection over Union (IoU) (0.75,0.5,0.33) on ShapeNet arrangement [3] test set.

We compare against the following 2D baseline model, which we call 2D-RNN: we remove the unprojection, egomotion estimation and stabilization and projection operations from our model. The baseline takes as input an image and the corresponding depth map, feeds it to a 2D encoder-decoder network with skip connections to obtain a 2D feature tensor. The camera parameters for this view are concatenated as additional channels to the 2D feature tensor and altogether they are fed to another 2D encoder-decoder network to obtain the 2D feature tensor for a 2D GRU memory update. We then feed the 2D memory feature tensor to an additional 2D encoder-decoder network and reshape the channel dimension of its output into feature vector to form a 4D tensor as prediction.

We show mean average precision numbers for 3D object detection and 3D segmentation for our model and the baseline in Table 3, and visualize predicted 3D bounding boxes and segmentations from GRNNs (GRNN-gtego-gtd) in Figure 6. GRNNs significantly outperform the 2D RNN. Groundtruth depth input significantly helps for 3D segmentation. This suggests that inferring depth using a cost volume as in [14] would potentially help depth inference as opposed to relying to a per frame depthnet [5] that does not have access to multiple views to improve its predictions. Please see the supplementary material for implementation details and more qualitative results.

5 Conclusion/Discussion

We presented GRNNs, recurrent neural networks that are equipped with differentiable geometric operations to estimate egomotion and build 3D deep feature maps for visual scene understanding on mobile visual agents. Our models add a new dimension to the latent space of previous recurrent models and ensure a geometrically-consistent mapping between the latent state and the 3D world scene. We showed spatial common sense emerges in GRNNs when trained in a self-supervised manner for novel view prediction. Our model can predict object arrangements, visibility and occlusion relationships in scenes with novel number, appearance and configuration of objects. We also showed that view prediction as a loss does not suffice for spatial common sense to emerge, since 2-dimensional models of previous works fail to strongly generalize.

Thus far, our model has been trained and tested on simulated scenes. Deploying our model on real mobile agents or static robots with cameras on their wrists and enabling them to learn to spatially reason and detect objects in the real world is a clear avenue for future work. We expect pretraining in simulated environments to help performance in the real world. Another limitation of the current model is that it operates on static scenes, scenes without any moving objects. Extending the proposed architectures to dynamic scenes is another very useful direction of future work. Finally, exploiting the sparsity of our 4D tensors to save GPU memory is a very important direction for scaling up our model to large scenes, potentially accompanied with memory allocation policies.

GRNNs pave the way for embodied agents that learn visual representations and mental models by observing and moving in the world: these agents learn autonomously and develop the reasoning capabilities of young toddlers as opposed to merely mapping pixels to labels using human supervision.


Appendix A GRNNs implementation details

The input images, output images, and predictions (for view prediction) all have size 64 64. Our pre-unprojection 2D encoder-decoder network has encoder layers with 32, 64, 128, and 256 channels, respectively. The decoder layers are symmetric to the encoder layers. The sizes of these feature maps are 32 32, 16 16, 8 8 and 4

4 respectively, since each convolution has stride 2. For depth prediction, which is used for object detection, we use the same 2D encoder-decoder network as described in this pre-unprojection step.

View prediction

We feed our feature tensor through a 3D encoder-decoder with skip-connections where the encoder has 64, 128, 256, 512, and 1024 channels respectively. The decoder is symmetric to the encoder. In our convLSTM decoder, we do not share weights across time steps. Indeed, allowing each step to have different weights was noted by  [6] to improve performance. We use six generation steps, and we did not see noticeable improvement on image reconstruction quality when more steps were used. We used 256 channels at each step. Both our model and the baseline are deterministic networks, we did not include any stochastic units in any of the two to accelerate training. We trained each model for 24 hours, using a batch size of 8 and a learning rate of . This resulted in roughly

steps of backpropagation for the tower (baseline) architecture and

steps for the GRNN architecture. We used the Adam optimizer.

Object detection/segmentation

For object detection and ego-motion prediction, we feed the unprojected feature tensor through a 3D encoder-decoder with skip-connections where the encoder has 16, 32, 64, 128 channels receptively, and the decoder is again symmetric to the encoder. For ego-motion prediction, when we compare the current memory with the rotated feature tensors generated from the new view, we use outputs from all the layers in the decoder (feature tensors with size 4 4 4 128, 8 8 8 64, 16 16 16 32, 32 32 32 16) to compute cross-convolutions. For object detection, we use only the last feature tensor from the 3D encoder-decoder as input and pass it to another 3D encoder-decoder with skip-connections to predict positive anchor centroids and their corresponding adjustments for the box centers. The channels in the second 3D encoder-decoder are set to 16, 32, 64, 128 and the corresponding final output is a 32 32 32 16 feature tensor. We then pass this feature tensor to one more 3D convolutional layer with kernel size 3, stride 1 and channel size 7. The final prediction is a 32 32 32 7 tensor with the first channel indicating positive anchor centroids and the last 6 channels indicating 3D boxes adjustments at each centroid. The model is trained with Adam optimizer (learning rate is set to ) without further parameter tuning. We train the model for roughly 25K iterations with batch size 2. We first train for egomotion prediction, and next jointly train for egomotion prediction and object detection and segmentation losses.

Appendix B Additional results

In Figures  7, 8 we show view prediction results on the rooms_ring_camera and shepard-metlzer dataset introduced in [6]. Since the camera intrinsics were not given, we used an estimated vertical and horizontal field of view of 60 degrees. Our model outperforms the baseline by a margin. In Figures 9, 10, we show more view prediction results on the ShapeNet arrangement dataset of [3]. In Figure 11, we show qualitative 3D object detection and segmentation results.

Input V1,V2,V3 query gt GRNNs Tower
Figure 7: View prediction results for the room scenes from  [6]
Input V1,V2,V3 query gt GRNNs Tower
Figure 8: View prediction results for the 7-segment shepard-metlzer dataset from  [6]
Input V1,V2,V3 query gt GRNNs Tower
Figure 9: View prediction results for ShapeNet arrangement test scenes from  [3]
Input V1,V2,V3 query gt GRNNs Tower
Figure 10: View prediction results for 4-object scenes from  [3]
Figure 11: Object detection and segmentation results. Blue and light blue grids in the last three columns show groundtruth voxel occupancy for the two objects present in the scene. 3D bounding boxes with different colors (red, green and magenta) are predicted from the proposed 3D MaskRCNN.