People possess an incredible ability to infer contextual information from a single image . Whether it is by using prior experience or by leveraging visual cues [3, 25], people are adept at reasoning about what may lie beyond the field of view and make use of that information for building a coherent perception of the world 
. Similarly, in robotics and computer vision, extrapolating useful information outside a camera’s field of view (FOV) plays an important role for many applications, such as goal-driven navigation[42, 5] or next-best-view approximation , where a global representation of the environment can improve preemptive planning for intelligent systems.
However, prior work in view extrapolation typically only predicts the color pixels beyond the image boundaries [26, 41, 32]. While inspiring, these methods do not predict 3D structure or semantics, and hence cannot be used directly for high-level reasoning tasks in robotics applications.
In this paper, we explore the task of directly extrapolating 3D structure and semantics for a full panoramic view of a scene when given a view covering 50% or less as input. We refer to this task as semantic-structure view extrapolation. Our method, Im2Pano3D, takes in a partial view of an indoor scene (e.g., a few RGB-D images) and uses a convolutional neural network to generate dense predictions for 3D structure and a probability distribution of semantic labels for a full panoramic view of that same scene.
This is a very challenging task. However, by learning the statistics of many typical room layouts, we can train a data-driven model to leverage contextual cues to predict what is beyond the field of view for typical indoor environments. For example, as shown in Fig.1, given half of a bedroom (180 °horizontal field of view), the system can predict the 3D structure and semantics for the other half. This requires it not only to extend the partially observed room structures (walls, floor, ceiling, etc.), but also to predict the existence and locations of objects that are not directly observed in the input (bed, window and cabinet) using statistical properties learned from data.
Semantic-structure view extrapolation poses three main challenges, which we address with corresponding key ideas shaping our approach to the task:
How to leverage strong contextual priors for indoor environments.
How to represent the 3D structure in a way that is good not only for recognition but also for reconstruction.
How to design meaningful loss functions when the possible solution is not unique – a small change to object locations may still result in a valid solution.
To leverage strong contextual priors for indoor environments, we represent 3D scenes in a single panorama image with channels encoding 3D structure and semantics. We train our model over a large-scale synthetic (SUNCG ) and real-world indoor scenes (Matterport3D ) encoded in this representation to learn the contextual prior.
To leverage strong geometric priors for indoor environments, we represent the 3D structure for each pixel with a 3D plane equation, rather than raw depth value at each pixel. By doing so, we take advantage of the fact that indoor environments are often comprised largely of planar surfaces. Since all pixels on the same planar surface have the same plane equation, the 3D structure is piecewise constant in a typical scene, which makes dense predictions of plane equations more robust than alternative representations.
To provide meaningful supervision for the network to cover the large solution space, we make use of multiple loss functions that account for both pixel level accuracy (pixel-wise reconstruction loss) and global context consistency (adversarial loss, and scene attribute loss).
The primary contribution of our paper is to propose the task of semantic-structure view extrapolation and present Im2Pano3D, a unified framework able to produce a complete room structure and semantic labeling when given a partial observation of a scene. This unified framework is able to handle different camera configurations and input modalities. The experimental results show that direct prediction of the 3D structure and semantics for the unobserved scene provides a more accurate result than alternative methods. Both the plane equation encoding and the context model learned from multi-level supervision with large scale indoor scenes help to improve prediction quality.
2 Related Work
The general scene understanding problem focuses on understanding what is present in an image, including scene classification[22, 40], semantic segmentation 
, depth and normal estimation[10, 38], etc. In this section, we review prior work on these tasks beyond the visible scene.
Texture synthesis and image inpainting.
train an autoencoder network. These methods can achieve very impressive inpainting results for holes in color images. However, it is challenging for them to predict image content far outside the field of view, since they don’t explicitly model structure and semantics.
Stitching images from the Internet.
Methods have also been proposed to extrapolate images drastically beyond the field of view using collections of Internet images. For example, Shan et al.  produce “uncropped images” by stitching together collections of images captured in the same scene. Hays and Efros  fill large holes by copying content from similar images in a large collection. While these methods produce impressive results, they only work for scenes where collections are available with many images from nearby viewpoints.
User-guided view extrapolation.
FrameBreak  performs dramatic view extrapolation. However, it uses a “guide image” provided by a person to constrain the image synthesis process. The guide image is chosen from a collection of panorama images, aligned with the input image, and then used to guide a patch-based texture synthesis algorithm. In this work, we aim to produce an image extrapolation framework that can be used for any common indoor environment without human intervention.
Predicting 3D structure in occluded regions.
Recently there have been many works addressing the problem of shape completion for individual objects [36, 27, 39] or scenes [37, 34, 9]. Given a partial observation of an object or scene, the task is to complete the shape of object in the occluded regions within the field of view. Unlike these methods, Im2Pano3D needs to predict the 3D structure outside the field of view, where there is no direct observation, which makes the problem much harder.
Predicting semantic concepts beyond the visible scene.
Khosla et al.  propose a framework to predict the locations of semantic concepts outside the visible scene, e.g., answering questions like “where can I find a restaurant” given a street-view image without direct sight of any restaurant. Although related, their work focuses on outdoor street view scenes and provides only high-level sparse semantic predictions. In contrast, we produce dense pixel-wise predictions for both 3D structure and semantics for pixels outside the observed view for indoor scenes.
3 Semantic-Structure View Extrapolation
We formulate the semantic-structure view extrapolation problem as an image inpainting task by representing both the input observation and output prediction as multi-channel panoramic images. The goal of Im2Pano3D is to predict the 3D structure and semantics for all missing regions in the input panorama. For the semantic prediction, instead of representing it as a discrete category, we model it as a probability distribution over all semantic categories as shown in Fig.2, which explicitly models the prediction uncertainty.
3.1 Whole Room Panoramic Representation
Traditional view synthesis works [31, 30] represent observations and new views using a set of disjoint images with their camera parameters. However this requires the network to handle arbitrary numbers of input views, infer spatial relationships between them, and reason about how scene elements cross image boundaries.
In contrast, we propose to represent the 3D scene using a single panorama where each pixel is labeled with multiple channels of information (color, 3D structure, and semantic) or marked as unobserved. This data representation allows the network to learn a consistent whole-room context model by describing both the observed and unobserved parts of the entire scene from a single viewpoint. It is particularly efficient for deep learning because the observations and predictions are resampled in a regular 2D parameterization suitable for convolution. Meanwhile, it can naturally support different input camera configurations through reprojection (see Fig.10). Given an observation of a 3D scene reconstructed from registered RGB-D images, we pick a virtual camera center and render the mesh onto four perspective image planes in a sky-box like fashion (see Fig.3). Each image plane has a horizontal FoV and a vertical FoV with a image size . Virtual camera centers are chosen depending on the dataset: for the Matterport3D dataset, we use tripod locations; for the SUNCG dataset, we randomly select locations in empty space; for short RGB-D videos, we use the median of all camera centers.
3.2 Representing 3D Surfaces with Plane Equations
While deep networks have been shown to perform well for predicting color pixels and semantic labels, they continue to struggle at predicting high-quality 3D structure. Current methods for direct regressing raw depth values produce blurred results [35, 21, 8]
, partly due to the viewpoint-dependent nature of depth maps and the large value variance of depth values even for nearby pixels on the same 3D plane. Surface normal predictions are generally higher quality; however, solving depth from normals is under-constrained and sensitive to noise. Other more complicated encodings, such as HHA, are designed for recognition, but cannot be used directly to recover the 3D structure.
In response to these issues, we propose to represent 3D surfaces with their plane equations: surface normal and plane distance to the virtual camera origin. We expect this representation to be easier to predict in indoor environments composed of large planar surfaces because all pixels on the same planar surface share the same plane equation – i.e., the representation is mostly piecewise constant. Moreover, the 3D location of each pixel can be solved trivially from its plane equation by intersection with a camera ray.
Our network is trained to optimize the predicted plane equations. We find this representation of 3D structure to be more effective than raw depth values. Fig.4 shows the qualitative comparison. We also have a post-processing step to further improve visual quality of the predicted geometry using plane-fitting on the predicted parameters (this step is not included in our quantitative evaluations).
3.3 Network Architecture
Our network architecture follows an encoder-decoder structure (Fig.5
), where the encoder produces a latent vector from an input panorama with missing regions, and the decoder uses that latent vector to produce an output panorama where the missing regions are filled. In this section, we discuss the key features of our network architecture.
Since our panoramic data representation consists of multiple channels (e.g. color, normal, plane distance to the origin, and probability distribution of semantics), we structured our network to process each channel with disjoint streams before merging into and after splitting from the middle layers. In the encoder, each stream is made up of three convolutional layers. The features produced from each stream are merged together by concatenation across channels and then passed through six joint convolutions layers to produce the latent vector. Mirroring this structure, the decoder passes the latent vector through six joint convolutions layers before splitting into multiple streams. This multi-stream structure provides the network a balance of learning both channel-specific parameters within each stream, and joint information through shared layers.
Reconstructing 3D surfaces with PN-Layer
Although our network architecture predicts the parameters of the plane equation as separate channels (surface normals and plane distances ), there is no explicit supervision to enforce the consistency between these two outputs. As a result, we find that with only the individual supervision, the 3D surfaces reconstructed from the predicted parameters tend to be noisy. To address this issue, we designed an additional layer in the network (called the PN-Layer) which takes the normal and plane distances as input, and uses the plane equation to produce a dense map of 3D point locations () for each pixel based on its respectively predicted , and pixel location. This layer is fully differentiable, and therefore an additional regression loss can be added on the predicted 3D point locations in order to enforce the consistency between the surface normal and plane distance predictions.
3.4 Network Losses
When predicting the scene content for the unobserved regions, the plausible solution might not be unique. For example, a valid prediction with slight changes to its locations could still represent an valid solution. To provide the supervision that reflects this flexibility, we use multiple losses to capture three levels of information: pixel-wise accuracy, mid-level contextual consistency using Patch-GAN (adversarial) loss , and global scene consistency measured by scene category and object distributions. The final loss for each channel is a weighted sum of the three level losses:
Pixel-wise reconstruction loss.
As part of network supervision, we backpropagate gradients based on the pixel-level reconstruction loss between the prediction and the ground truth panoramas. The loss differs for each output channel. We use softmax loss for semantic segmentation, cosine loss for normal , and loss for plane distance and final 3D point locations .
Following the recent success of generative adversarial networks, we model supervision for generating high-frequency structures in the output panoramas by using a discriminator network  adapt from PatchGAN . Similar to the generator, the discriminator network processes each channel with disjoint streams before merging features into shared layers. For the real semantic examples, we converted them into a probabilistic distribution over classes of size before feeding them into the discriminator. We adopt the method proposed by Luc et al. : For each pixel , given its ground-truth label , we set the probability for that pixel and that label to be , where is the corresponding prediction from netwotk, and . For all other classes we set , so that the label probabilities in sum to one for each pixel.
Scene attribute loss.
We add additional supervision to the network in order to regularize high level scene attributes such as scene category and overall object distributions. To make the network aware of different scene categories, we added two fully connected layers that predict the room category (over 8 scene categories) of the input panorama from its latent code generated by the encoder. We backpropagate gradients directly through the encoder from the softmax classification loss on the scene category predictions. Furthermore, we added another auxiliary network that computes the pixel-level distribution of different object classes from its semantic prediction, and backpropagates gradients from comparing this distribution to the ground truth distribution through an loss. Our ablation studies in Sec.4 demonstrate that these additional losses help to improve the semantic predictions, especially for small objects.
In this section, we present a set of experiments to evaluate Im2Pano3D. We not only investigate how well it predicts semantics and structure for unseen parts of a scene, but also study the impact of each algorithmic components through ablation studies. In most of our experiments, we consider the case where the input observation has a horizontal and vertical FoV, resulting in 50% partial observation (Fig.8). In later experiments, we demonstrate our approach on other camera configurations. All evaluations are performed on unobserved regions only.
For our experiments, we use both synthetic (SUNCG ) and real (Matterport3D ) datasets. The former is used for pre-training and ablation studies. The latter is used for final evaluation on real data.
SUNCG : This dataset contains synthetically rendered panoramic images with color, depth and semantic of synthetic 3D indoor rooms. In total, we use 58,866 panoramas for training, and 480 for testing.
Matterport3D : This dataset contains real RGB-D panoramas captured with a tripod-mounted Matterport camera. We use color, depth and semantics provided by the dataset, but re-rendered them to form our panoramic representation (Sec. 3.1). In total, we use 5,315 panoramas for training, and 480 for testing.
|models||semantics||3D surface (m)||normals ()|
|models||semantics||3D surface (m)||normals ()|
4.2 Baseline Methods
To our knowledge, there is currently no prior work that performs our task exactly. To provide baselines for comparison, we consider the following extensions to related work:
Average distribution (avg) computes a per pixel average of all images within the training set.
Average distribution by scene category (avg-type) computes a per pixel average of all training images within the scene category. The prediction is chosen by the testing images’ ground truth scene categories.
Nearest neighbor (nn)
retrieves the nearest neighbor image based on ImageNet features, and uses its semantic segmentation and depth map as the prediction.
Tab.1 and 2 summarize the quantitative results. Models are labeled by their input and output modality acronyms; rgb: color, s: semantics, d: depth, p: plane distance, n: surface normal. For example, model [d2d] takes in a depth map as input and predicts the raw depth values of the unobserved regions. To evaluate the algorithm’s performance independent of segmentation accuracy over the observed regions, for the [pns2pns] models, we assume ground truth segmentation for the observed region as input.
We measure the quality of the predicted 3D geometric structure with the following metrics:
Normal angle: the mean and median angles (in degrees) between prediction and the ground truth, and the percentage of pixels with error less than three thresholds (11.25, 22.5, 30).
Surface distance: the mean and median L2 distances (in meters) between final predicted 3D point locations and the ground truth, and the percentage of pixels with error less than three thresholds (0.2m, 9.5m, 1m).
We measure the quality of the predicted semantic with the following metrics:
Probability over ground truth (PoG): the pixelwise probability prediction of the ground truth labels averaged within each class then averaged across categories.
Class existence (exist): the F1 score of object class existence predictions averaged across all classes (where existence defined as 400 pixels).
Class size (size): the pixel size difference between ground truth and predictions divided by the ground truth size. Evaluated on the object categories with correct existence predictions only.
Earth Mover’s Distance (EMD): the average Earth Mover’s Distance  between the predicted and ground truth 3D points for the categories with correct existence prediction. The weight of each 3D point is assigned with its predicted probability. The probability is normalized to sum up to one for each category. We use k-center clustering (k=50) to cluster the 3D points before calculating the EMD.
IoU: the intersection over union of the most likely predicted pixel label, averaged across all classes.
Accuracy (acc): the percentage of correctly predicted pixels across all pixels.
The first four metrics of semantic evaluation are newly introduced for this task. Unlike most semantic segmentation tasks, where predictions are made for pixels directly observed with a camera, our task is to predict semantics for large regions of unobserved pixels, which often contain completely unseen objects. For this task, predicting the existence and size of unseen objects is already very difficult and useful for many applications, and thus we include the existence and size metrics, which are invariant to precise object locations. We also introduce metrics based on the predicted probability distribution (PoG and EMD), which account for soft errors in position. We use PoG to rank algorithms in our comparisons.
4.3 Experimental Results
Comparing to Baseline Methods.
Comparing our model [rgbpn2pns (s+m)] to all baseline methods (Tab.2 row 2-5), our proposed model produces better predictions in terms of both semantics and 3D structure. In particular, compared to the two-step process of predicting semantic labels over predicted color images in the unobserved regions [inpaint], directly predicting semantic labels in a one-step process can generate a more accurate result (+13% in PoG and -0.24m in surface distance). Fig.6 shows a qualitative comparison.
Do different surface encodings matter?
Comparing the model using raw depth values [d2d] to the model using the plane equation encoding [pn2pn] (Tab.1 and Fig.4), we can see that the plane equation encoding provides a strong regularization allowing the network to predict higher quality 3D geometry with lower surface distance and normal error, 0.03m and 21 less respectively.
What are the effects of different losses?
Comparing the model trained with adversarial loss [pns2pns+S+A] and without [pns2pns+S] in Tab.1, we can see that the adversarial loss improves the prediction accuracy for small objects, which is reflected in higher IoU (+2%). Meanwhile the adversarial loss reduces recall for objects with big pixel area, which is reflected in lower total pixel accuracy (-1.2%). Similarly, the scene attribute loss also improves IoU (+2%), with a small compromise on total pixel accuracy (-0.3%).
Does synthetic data help?
Comparing our models [pns] and [rgbpn2pns] trained with and without the SUNCG dataset and testing on the Matterport3D dataset, we observe that pre-training on SUNCG significantly improves the model’s performance, and improvement in PoG respectively. In particular, when the input is a segmentation map instead of a color image [pns2pns], the model trained only on SUNCG can even achieve better performance than the model trained on Matterport3D alone (+1.3% in PoG and -0.08m in surface distance). This result demonstrates that training on synthetic data is critical for this task, as it enables the network to learn a rich whole-room contextual prior from a large variety of indoor scenes, which is extremely expensive to obtain with real data.
How is accuracy influenced by distance to observation?
Fig.9 (a) shows the average IoU with respect to its distance to the nearest observed pixel. As expected, the performance for Im2Pano3D decreases for pixels that are further from the input observation. However, the performance is still much higher than other baselines when the region is far from the observation or completely behind the camera, yet still not as high as human performance.
How is accuracy influenced by input FoV?
To investigate how the input FoV affects the prediction accuracy, we do the following experiment: we keep the vertical FoV of the input image at while steadily increasing the horizontal FoV from to , and ask the network to predict the structure and semantics for the full panorama. Fig.9 (b) shows prediction accuracy in the unobserved regions with respect to input FoV, which shows that the prediction accuracy improves as the input FoV increases.
Generalizing to different camera configurations.
In most of our evaluations, we consider the case where the input observation has a horizontal FoV. However, in real robotic applications, systems may be equipped with different types of cameras resulting in different observation FoV patterns. Here we demonstrate how Im2Pano3D can generalize to other cases. The camera configurations we consider includes: single or multiple registered RGB-D cameras such as Matterport cameras (Fig.10 (a-d)), single RGB-D camera capturing a short video sequence (e), color-only panoramic camera (f), and color panoramic cameras paired with a single depth camera (g). To improve the ability of the network to generalize to different input observation patterns, we use a random view mask during training. Tab.3 shows the qualitative evaluation. For all of these camera configurations, Im2Pano3D provides a unified framework that effectively fills in the missing 3D structure and semantic information of the unobserved scene.
We propose the task of semantic-structure view extrapolation and present Im2Pano3D, a unified framework to produce a complete room structure and semantic estimation conditioned on a partial observation of the scene. Experiments demonstrate that the direct prediction of structure and semantics for the unobserved scene provides more accurate results than alternative approaches. However, while Im2Pano3D explores the possibilities of whole-room contextual reasoning for 3D scene understanding, the proposed system is still far from perfect. Possible future directions may include: explicitly modeling semantics at the instance-level as opposed to category-level, and exploring alternative data representations that consider occluded regions.
This work is supported by Google, Intel, and the NSF(VEC 1539014/ 1539099). It makes use of data from Matterport3D and Planner5D, and hardware donated by NVIDIA and Intel. Shuran Song is supported by a Facebook Fellowship.
-  http://robotics.stanford.edu/ rubner/emd/default.html.
-  Scene toolkit: https://github.com/smartscenes/stk.
-  M. Bar. Visual objects in context. Nature reviews. Neuroscience, 5(8):617, 2004.
-  C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph., 28(3):24–1, 2009.
-  J. Borenstein and Y. Koren. Real-time obstacle avoidance for fast mobile robots. IEEE Transactions on Systems, Man, and Cybernetics, 19(5):1179–1187, 1989.
-  A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: Learning from RGB-D data in indoor environments. 2017.
-  A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 2, pages 1033–1038. IEEE, 1999.
-  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
-  M. Firman, O. Mac Aodha, S. Julier, and G. J. Brostow. Structured prediction of unobserved voxels from a single depth image. 2016.
-  R. Garg, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning rich features from RGB-D images for object detection and segmentation. 2014.
-  J. Hays and A. A. Efros. Scene completion using millions of photographs. In ACM Transactions on Graphics (TOG), volume 26, page 4. ACM, 2007.
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 229–238. ACM, 1995.
-  J. Hochberg. Perception (2nd edn), 1978.
-  H. Intraub and M. Richardson. Wide-angle memories of close-up scenes. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15(2):179, 1989.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
-  D. Jayaraman and K. Grauman. Learning to look around. arXiv preprint arXiv:1709.00507, 2017.
-  A. Khosla, B. An, J. J. Lim, and A. Torralba. Looking beyond the visible scene. In CVPR, Ohio, USA, June 2014.
-  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016.
-  L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Advances in neural information processing systems, pages 1378–1386, 2010.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. Lecun. Predicting Deeper into the Future of Semantic Segmentation. In ICCV 2017, 2017.
-  K. Lyle and M. Johnson. Importing perceived features into false memories. Memory, 14(2):197–213, 2006.
-  D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros. Context encoders: Feature learning by inpainting. 2016.
-  J. Rock, T. Gupta, J. Thorsen, J. Gwak, D. Shin, and D. Hoiem. Completing 3D object shape from one depth image. 2015.
-  Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases. In Computer Vision, 1998. Sixth International Conference on, pages 59–66. IEEE, 1998.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
S. M. Seitz and C. R. Dyer.
Physically-valid view synthesis by image interpolation.In Representation of Visual Scenes, 1995.(In Conjuction with ICCV’95), Proceedings IEEE Workshop on, pages 18–25. IEEE, 1995.
-  S. M. Seitz and C. R. Dyer. View morphing. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 21–30. ACM, 1996.
-  Q. Shan, B. Curless, Y. Furukawa, C. Hernandez, and S. M. Seitz. Photo uncrop. In European Conference on Computer Vision, pages 16–31. Springer, 2014.
-  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. 2012.
-  S. Song, F. Yu, A. Zeng, A. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. 2017.
-  K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned depth prediction. arXiv preprint arXiv:1704.03489, 2017.
-  J. Varley, C. DeChant, A. Richardson, A. Nair, J. Ruales, and P. Allen. Shape completion enabled robotic grasping. 2016.
-  K. Wada, K. Okada, and M. Inaba. Fully convolutional object depth prediction for 3d segmentation from 2.5 d input.
-  X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 539–547, 2015.
-  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D ShapeNets: A deep representation for volumetric shapes. 2015.
-  J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010.
-  Y. Zhang, J. Xiao, J. Hays, and P. Tan. Framebreak: Dramatic image extrapolation by guided shift-maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1171–1178, 2013.
Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi.
Target-driven visual navigation in indoor scenes using deep reinforcement learning.In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 3357–3364. IEEE, 2017.
Appendix A Additional Algorithm Details
. Let C(k,c) denote a Convolution-BatchNorm-ReLU layer with k filters with c channels, and CD denote a Convolution-BatchNormDropout-ReLU layer with a dropout rate of 50%.
Encoder: [ C(16,x)-C(32,16)-C(64,32) ]x4 - C(192,128) - C(128,256) - C(256,256) - C(256,256) - C(256,256) - C(512,256)
Decoder: CD(512,256) - CD(512,256) - CD(512,256) - C(512,256) -C(512,128) - C(256,128) - [C(256,64) - C(128,64) - C(64,x)]x3
Discriminator: [C(x,16)]x3 - C(16,64) - C(64,128) - C(64,512).
The number of channels () of the encoder’s first layer and the decoder’s last layer depends on the stream’s mortality: for color stream , for normal stream , for plan distance stream , for segmentation stream in encoder, and in decoder and discriminator.
We implement our network architecture in Torch7. We randomly initialize all layers by drawing weights from a Gaussian distribution with mean 0 and standard deviation 1. For the final model, we pretrain the network on SUNCG dataset for 20 epochs, and finetune on Matterport3D for 20 epochs,both with batch size 3.
The following equations describe the details on how we weight different losses for each channel, note that we add the loss from PN-layer only after 1000 iteration to avoid an unstable gradient:
The combined loss for plane distance:
where , and when training iteration otherwise.
The combined loss for surface normal:
,where , and when training iteration otherwise.
The combined loss for semantics: , where , and .
The PN-Layer takes in a predicted normal map and plane distance map and calculates the final 3D point location for each pixel. If the pixel normal prediction is (normalized to be unit length), the plane distance prediction is , the camera intrinsics matrix is , the virtual camera center is at , and the 2D pixel location is , then the computed 3D point location is
When is the origin, we can simplify the equation to:
Appendix B Details on evaluation metrics
In this section we provide details on the new evaluation metrics: PoG, size, and EMD. We consider only pixels in the unobserved regions of the panorama for all evaluations.
Probability over ground truth (PoG)
Let be the predicted probability of pixel for class , where and . Let be the collections of all pixels in the ground truth segmentation map with label equals to class and be the total number of pixels in this collection. Then for class is defined as follows:
Let be the total number of pixels in the ground truth segmentation map with labels equal to class and be the total number of pixels in predicted segmentation map equals to class . For each object class where and , the size difference is defined as follow:
Earth Mover’s Distance (EMD):
Computing the EMD is expensive for two arbitrary 2D distributions. So, we approximate the calculation using the implementation from . We first cluster all the pixels in into 50 clusters using the K-center algorithm, , where is the cluster representative (cluster center) and is the weight of the cluster for each cluster. We also cluster all pixels in the predicted probability distribution map with into 50 clusters, , where is the representative of cluster (cluster center) and is the weight of the cluster for each cluster. Then we find the flow among all flows between and that minimizes the overall cost:
subjected to the constraints:
The earth mover’s distance is defined as the work normalized by the total flow:
Appendix C Additional Experiment Results
Normal and depth error distribution
Fig.11 shows a histogram of the 3D surface distance error (L2 distance in meters) on the test set. We can see that most of the errors are within 1 meter to the ground truth. Fig. 12 shows a histogram of the angular error in predicted surface normals on the test set. We can see that most of the normal predictions fall within 0°to 20°of ground truth.
Where should the camera look?
As shown in the main paper, observing more of the scene typically leads to higher accuracy in predicting the rest of the scene. However, we also notice that the observation pattern (defined by the cameras’ locations and their viewing angles) also has a strong impact on prediction accuracy. Comparing [top6] and [middle3] in Tab.3 of the main paper, we can see that although [top6] has more cameras and view coverage, its prediction accuracy is lower because the cameras are looking at regions of scene with low information density (e.g. ceilings). Fig.13 shows the spatial distribution of semantic prediction error. The red regions (indicating areas with higher error) in the lower half of the panorama show that some parts of the scene are harder to predict than others. This error map could help determine camera placement on a domestic robot, where the camera should be oriented to look at those regions in order to reduce overall uncertainty.
Additional Quantitative results
Figs 17 to 17 show some typical prediction results with detailed analysis. Fig.19 shows additional result for short video sequence input from NYU dataset . Fig.21 shows additional result on SUNCG dataset . Fig.21 shows additional result on Matterport3D dataset .
Per-category performance breakdown
Tables 8 and 8 show the per-category performance for model rgbpn2pns and pns2pns on the Matterport3D dataset evaluated with PoG, IoU, EMD, class existence, and class size. We find that the network performs well on predicting room structure (i.e. wall, floor, ceiling) and large furniture (e.g. bed, door, etc.). Although the network finds it challenging to predict smaller objects in precisely the same locations as ground truth (as expected), it still performs well at predicting their existence.
Example human completions
Fig.18 shows more examples of human completion. In some examples, completion results from different users can be quite consistent (e.g. row 4), while in other examples, different users can generate very different completion results. For example, in row 3, two users design their completion predictions thinking that the room is a bedroom, while the other two design their predictions thinking that the room is a living room. Interestingly, regardless of which room type with which the user formulates his/her completion result, all predictions include the existence of windows on walls in an arrangement consistent with ground truth