Semantic Scene Completion from a Single Depth Image
This paper focuses on semantic scene completion, a task for producing a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation. Previous work has considered scene completion and semantic labeling of depth maps separately. However, we observe that these two problems are tightly intertwined. To leverage the coupled nature of these two tasks, we introduce the semantic scene completion network (SSCNet), an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously outputs occupancy and semantic labels for all voxels in the camera view frustum. Our network uses a dilation-based 3D context module to efficiently expand the receptive field and enable 3D context learning. To train our network, we construct SUNCG - a manually created large-scale dataset of synthetic 3D scenes with dense volumetric annotations. Our experiments demonstrate that the joint model outperforms methods addressing each task in isolation and outperforms alternative approaches on the semantic scene completion task.READ FULL TEXT VIEW PDF
Semantic scene completion is the task of producing a complete 3D voxel
We introduce a View-Volume convolutional neural network (VVNet) for infe...
We propose a novel model for 3D semantic completion from a single depth
Semantic Scene Completion (SSC) aims to simultaneously predict the volum...
As a voxel-wise labeling task, semantic scene completion (SSC) tries to
We address the task of 3D semantic scene completion, i.e. , given a sing...
We introduce Spatial Group Convolution (SGC) for accelerating the comput...
Semantic Scene Completion from a Single Depth Image
We live in a 3D world where empty and occupied space is determined by the physical presence of objects. To successfully navigate within and interact with the world, we rely on an understanding of both the 3D geometry and the semantics of the environment. Similarly, for a robot, the ability to infer complete 3D shape from partial observations is necessary for low-level tasks such as grasping and obstacle avoidance , while the ability to infer the semantic meaning of objects in the scene enables high-level tasks such as retrieval of objects.
With this motivation, our goal is to have a model that predicts both volumetric occupancy (i.e., scene completion) and object category (i.e., scene labeling) from a single depth image of a 3D scene — in this paper we refer to this task as semantic scene completion (Figure 1). Prior work is limited to address only part of this problem as shown in Figure 2: RGB-D segmentation approaches consider only visible surfaces without the full 3D shape [6, 26], while shape completion approaches consider only geometry without semantics  or a single object out of context [32, 34].
Our key observation is that the occupancy patterns of the environment and the semantic labels of the objects are tightly intertwined. Therefore, the two problems of predicting voxel occupancy and identifying object semantics are strongly coupled. In other words, knowing the identity of an object helps us predict what areas of the scene it is likely to occupy without direct observation (e.g., seeing the top of a chair behind a table and inferring the presence of a seat and legs). Likewise, having an accurate occupancy pattern for an object helps us recognize its semantic class.
To leverage the coupled nature of the two tasks we jointly train a deep neural network using supervision targeted at both tasks. Given a single-view depth map as input, our semantic scene completion network (SSCNet) produces one of N+1 labels for all voxels in the view frustum. Each voxel is labeled as occupied by one of N object categories or free space. Most critically, this prediction extends beyond the projected surface implied by the depth map, thus providing occupancy information for the entire scene.
To achieve this goal there are several issues that must be addressed. First, how do we effectively capture contextual information from 3D volumetric data, where the signal is sparse and lacks high frequency detail? Second, since existing RGB-D datasets only provide annotations on visible surfaces, how do we obtain training data with complete volumetric annotations at scene level?
To address the first issue, we design a 3D dilation-based context module that efficiently expands our network’s receptive field to model the contextual information. We find that a big receptive field is crucial for the task. As demonstrated in Figure 2, looking at the small region of a chair in isolation, it is hard to recognize and complete the chair. However, if we consider the context due to surrounding objects, such as the table and floor, the problem is much easier.
To address the second issue, we construct SUNCG, a large-scale synthetic 3D scene dataset with more than 45622 indoor environments designed by people. All the 3D scenes are composed of individually labeled 3D object meshes, from which we can compute 3D scene volumes with dense object labels though voxelization.
Our experiments with these solutions demonstrate that a method that jointly predicts volumetric occupancy and object semantic can outperform methods addressing each task in isolation. Both the 3D context model learned by our network and the large-scale synthetic training data help to improve performance significantly.
Our main contribution is to formulate an end-to-end 3D ConvNet model (SSCNet) for the joint task of volumetric scene completion and semantic labeling. In support of that goal, we design a dilation-based 3D context module that enables efficient context learning with large receptive fields. To provide the training data for our network, we introduce SUNCG, a manually created large-scale dataset of synthetic 3D scenes with dense occupancy and semantic annotations.
We review related work on RGB-D segmentation, 3D shape completion, and voxel space semantic labeling.
Many prior works focus on RGB-D image segmentation [6, 26, 29, 15]. However, those methods focus on obtaining semantic labels for only the observed pixels without considering the full shape of the object, and hence cannot directly perform scene completion or predict labels beyond the visible surface.
Other prior works focus on single object shape completion [33, 28, 34, 32]. To apply those methods to scenes, additional segmentation or object masks would be required. For scene completion, when the missing regions are relatively small, methods using plane fitting  or object symmetry [13, 19] can be applied to fill in holes. However, these methods heavily rely on the regularity of the geometry and often fail when the missing regions are big. Firman et al.  show promising completing results on scenes. However, their approach is based purely on geometry without semantics, and thus it produces less accurate results when the scene structure becomes complex.
One possible approach to obtain the complete geometry and semantic labels for a scene is to retrieve and fit instance-level 3D mesh models to the observed depth map [7, 30, 4, 16, 23, 17, 14]. However, the prediction quality of this type of approach is limited by the quality and variety of 3D models available for retrieval. Naturally, observed objects that cannot be explained by the available models tend to be missed. Or, if the 3D model library is large enough to include all observations, then a difficult retrieval and alignment problem must be solved. Alternatively, it is possible to use 3D primitives such as bounding boxes to approximate the 3D geometry of objects [11, 18, 31]. However, the bounding box approximation limits the geometric detail of the output predictions.
Another line of work completes and labels 3D scenes, but with separate modules for feature extraction and context modeling. Zhenget al.  predict the unobserved voxels by physical reasoning. Kim et al.  train a Voxel-CRF model from labeled floor plans to optimize the semantic labeling and reconstruction for indoor scenes. Hane et al.  and Blaha et al.  use joint optimization for multi-view reconstruction and segmentation for outdoor scenes. However, this line of work uses predefined features, and separates the feature learning from the context modeling, and it is expensive for CRF-based models to encode long-range contextual information. In contrast, our model is able to jointly learn the low-level feature representation and high-level contextual information end-to-end from large-scale 3D scene data, directly modeling long-range contextual cues though big receptive field.
Our paper leverages data generated from a large-scale synthetic 3D scene dataset. Although recent works have been focusing on generating segmentation labels for 2D image through rendering synthetic scenes [8, 27], the 3D aspect of such data has not been fully utilized. Existing datasets focus either on objects [2, 34] or a small number of rooms (57 rooms in ). In contrast, our dataset is several orders of magnitude larger than existing 3D scene datasets (45,622 houses with 775,574 rooms) providing a diverse set of furniture arrangements manually created by people.
Given a single-view depth map observation of a 3D scene, the goal of our semantic scene completion network is to map the voxels in the view frustum to one of the class labels , where N is number of object classes and represents empty voxels. During training, we render depth maps from virtual viewpoints of our synthetic 3D scenes and voxelize the full 3D scenes with object labels as ground truth. During testing, the observation depth images come from a RGB-D camera.
shows an overview of our processing pipeline. We take a single depth map as input and encode it as a 3D volume. This 3D volume is then fed into a 3D convolutional network, which extracts and aggregates both local geometric and contextual information. The network produces the probability distribution of voxel occupancy and object categories for all voxels inside the camera view frustum.
The first issue we need to address is how to encode the observed depth as input to the network. For the semantic scene completion task, the ideal encoding should directly represent the 2D observation into the same 3D physical space as the output in a way that is invariant to the viewpoint projection, and provide a meaningful signal for the network to learn geometry and scene representation. To this end, we adopt Truncated Signed Distance Function (TSDF) to encode the 3D space, where every voxel stores the distance valueto its closest surface, and the sign of the value indicates whether the voxel is in free space or in occluded space. To better suit our task, we make the following modifications to the standard TSDF.
Most RGB-D reconstruction pipelines speed up the TSDF computation by using the projective TSDF which finds the closest surface points only in the line of sight of the camera . This projective TSDF is fast to compute, but is inherently view-dependent. Instead, we choose to compute the distance to the closest point anywhere on the full observed surface.
Another issue with TSDF is that strong gradients occur in the empty space along the occlusion boundary between . It is possible to eliminate this gradient by removing the sign, however, the sign is important for completion task since it indicates the occluded regions of the scene that need to be completed. To solve this problem we flip the TSDF value as follows: . This flipped TSDF has the strong gradient on surface, providing a more meaningful signal for the network to learn better geometric features. The different encoding is visualized in Figure 5, and Table 3 shows its impact on performance.
The network architecture of SSCNet is shown in Figure 3
. Taking a high-resolution 3D volume as input, the network first uses several 3D convolution layers to learn a local geometry representation. We use convolution layers with stride and pooling layers to reduce the resolution to one fourth of original input. Then, we use a dilation-based 3D context module to capture higher-level inter-object contextual information. After that, the network responses from different scales are concatenated and fed into two more convolution layers to aggregate information from multiple scales. At the end, a voxel-wise softmax layer is used to predict the final voxel label. Several shortcut connections are added for better gradient propagation. In implementing this architecture, we made the following design decisions:
Given a 3D scene, we rotate it to align with gravity and room orientation based on Manhattan assumption. The dimensions of the 3D space we consider are horizontally, vertically, and in depth. We encode the 3D scene into a flipped TSDF with grid size , truncation value , resulting in a volume as the network input.
Context can provide valuable information for understanding the scene, as demonstrated by much prior work in image segmentation . In the 3D domain, context is more useful due to a lack of high frequency signals compared to image textures. For example, tabletops, beds, and floors are all geometrically similar to flat horizontal surfaces, so it is hard to distinguish them given only local geometry. However, the relative positions of objects in the scene are a powerful discriminatory signal. To learn this contextual information, we need to make sure our network has a big enough receptive field. To this end, we extend the dilated convolution presented by Yu and Koltun  to 3D. Dilated convolution extends normal convolution by adding a step size when the convolution extracts values from the input before convolving with the kernel. Thus we can exponentially expand the receptive field without a loss of resolution or coverage, while still using the same number of parameters. Figure 4 compares the receptive field size of SSCNet with 3D ConvNet architectures from prior work.
Different object categories have very different physical 3D sizes. This implies that the network will need to capture information at different scales in order to recognize objects reliably. For example, we need more local information to recognize smaller objects like TVs, while we need more global information to recognize bigger objects like beds. In order to aggregate information at different scales we add a layer that concatenates the network responses with different receptive field. We then feed this combined feature map into two convolution layers, which allows us to propagate information across responses from different scales.
Due to the sparsity of 3D data, the ratio of empty vs. occupied voxels is around 9:1. To deal with this imbalanced data distribution, we sample the training so that each mini-batch has a balanced set of empty and occupied examples. For each training volume containing occupied voxels, we randomly sample empty voxels from occluded regions for training. Voxels in free space, outside the field of view, or outside the room are ignored.
The loss function of the network is the sum of voxel-wise softmax loss, where is softmax loss, is the ground truth label, is the predicted probability of the voxel at coordinates over the classes, where is the number of object categories and empty voxels are labeled as class . The weight is equal to zero or one based on the sampling algorithm described above.
We implement our network architecture in Caffe. Pre-training SSCNet on the SUNCG training set takes around a week on a Tesla K40 GPU, and fine-tuning on the NYU dataset takes 30 hours. During training, each mini-batch contains one 3D view volume, requiring
of GPU memory. To obtain more stable gradient estimates, we accumulate gradients over four iterations and update the weights once afterwards.
One of the main obstacles of training deep networks for scene-level dense 3D predictions is the lack of large annotated datasets with dense object semantic annotations at the voxel level. Existing RGB-D datasets with surface reconstructions are subject to occlusions or partial observations, and cannot provide the volumetric occupancy and semantic labels for the entire space at the voxel level. To obtain volumetric occupancy ground truth Firman et al.  collect a tabletop dataset with reconstructed RGB-D video using KinectFusion . However, this data does not provide semantic labels, and only contains simple tabletop scenarios. In this paper, we present a new large-scale synthetic 3D scene dataset, from which we obtain a large amount of training data with synthetically rendered depth images and volumetric ground truth.
Our SUNCG dataset contains different scenes with realistic room and furniture layouts that are manually created though the Planner5D platform . Planner5D is an online interior design interface that allows users to create multi-floor room layouts, add furniture from a object library, and arrange them in the rooms. After removing duplicated and empty scenes, we ensured the quality of the data with a simple Mechanical Turk cleaning task. During the task, we show a set of top view renderings of each floor and ask turkers to vote whether this is a valid apartment floor. We collect three votes for each floor, and consider a floor valid when it has at least two positive votes. In the end, we have valid floors, with contain rooms and object instances from unique object meshes covering categories. We manually labeled the all objects in the library to assign category labels. Figure 6 shows example scenes from the resulting SUNCG dataset. More information can be found in the appendix.
|scene completion||semantic scene completion|
|Lin et al. (NYU) ||58.5||49.9||36.4||0||11.7||13.3||14.1||9.4||29||24||6.0||7.0||16.2||1.1||12.0|
|Geiger and Wang (NYU) ||65.7||58||44.4||10.2||62.5||19.1||5.8||8.5||40.6||27.7||7.0||6.0||22.6||5.9||19.6|
To generate synthetic depth maps that mimic a typical image capturing process, we use a set of simple heuristics to pick camera viewpoints. Given a 3D scene, we start with a uniform grid of locations spaced atintervals on the floor and not occupied by objects. We then choose camera poses based on the distribution of the NYU-Depth v2 dataset.111
The camera height is sampled from a Gaussian distribution withand . The camera tilt angle is sampled from a Gaussian distribution with and . Then, we render the depth map using the intrinsics and resolution of the Kinect. After that we use a set of simple heuristics to exclude bad viewpoints. Specifically, a rendered view is considered valid if it satisfies the following three criteria: a) valid depth area (depth values in range of to ) larger than of image area, b) there are more than two object categories apart from wall, ceiling, floor, and c) object area apart from wall, ceiling, floor is larger than of image area. To reduce data redundancy, we pick at most five images from each room. In total we generate valid views for training our SSCNet.
Since the 3D scenes in the SUNCG dataset consist of a limited number of object instances, we speed up the voxelization process by first voxelizing each individual object in the library and then transforming the labels based on each scene configuration and view point. Specifically, we first voxelize each object to a voxel grid. We set the voxel size so that the largest dimension of the object is a tight fit to the object bounding box. Thus, varies between objects due to the difference in object dimensions. We use the binvox  voxelizer which accounts for both surface and interior voxels by using a space carving approach.
Given a camera view, we define a voxel grid in world coordinates, with scene voxel size equals to . Then for each object in the scene, we transform the object voxel grid by translating, rotating and scaling by the object’s transformation. We then iterate over each voxel in the scene voxel grid that is inside the transformed object bounding box, and calculate the distance to the nearest neighbor object voxel. If the distance is smaller than the object voxel size , this scene voxel will be labeled with this object category. Similarly, we label all voxels in the scene that belong to walls, floors, and ceilings by treating them as planes with thickness equal to one scene voxel size. All remaining voxels are marked as empty space, therefore providing a fully labeled voxel grid for the camera view.
In this section, we evaluate our proposed methods with a comparison to alternative approaches and an ablation study to better understand the proposed model. We evaluate our algorithm on both real and synthetic datasets.
For the real data, we use the NYU dataset , which contains 1449 depth maps captured from Kinect. We obtain the ground truth by voxelizing the 3D mesh annotations from Guo et al. , mapping object categories based on Handa et al. . The annotations consist of 33 object meshes in 7 categories, other categories approximated using 3D boxes or planes. In some cases, the mesh annotation is not perfectly aligned with depth due to labeling error and the limited set of meshes. To deal with this misalignment, Firman at el.  propose to use rendered depth map from the annotation for testing. However, by rendering the overly simplified meshes, geometric detail is lost especially in cases where objects are represented as a box. Therefore, we test with both rendered depth maps and the originals.
For synthetic data, we created a test set from SUNCG which has objects with detailed geometry, and for which the depth map and ground truth volumes are perfectly aligned. The SUNCG test set consists of 500 depth images rendered from 184 scenes that are not in the training set.
As our evaluation metric, we use the voxel-level intersection over union (IoU) of predicted voxel labels compared to ground truth labels. For the semantic scene completion task, we evaluate the IoU of each object classes on both the observed and occluded voxels. For the scene completion task, we treat all non-empty object class as one category and evaluate IoU of the binary predictions on occluded voxels. Following Firmanet al. , we do not evaluate on voxels outside the view or the room.
In Table 1 we compare on the semantic scene completion task with approaches from Lin et al.  and Geiger and Wang . Both these algorithms take an RGB-D frame as input and produce object labels in the 3D scene. Lin et al. use 3D bounding boxes and planes to approximate all objects. Geiger and Wang retrieve and fit 3D mesh models to the observed depth map at test time. The mesh model library used for retrieval is a superset of the models used for ground truth annotations. Therefore, they can achieve perfect alignments by finding the exact mesh model in a small database. In contrast, our algorithm is based on only depth and does not use additional mesh model at test time. Despite this data disparity, our network produces more accurate voxel-level predictions (30.5% vs. 19.6%). An example of the difference is shown in the third row of Figure 7: both Lin et al. and Geiger and Wang’s approaches confuse the sofa as a bed while our network correctly recognizes the sofa. Moreover, since our method does not require the model fitting step it is much faster at 7s compared to 127s per image .
Previous work has shown scene completion is possible without semantic understanding. We examine to what extent the supervision of object semantics benefits the scene completion task. To do this, we trained a model predicting the occupancy of each voxel by doing binary classification on each voxel (“empty” vs. “occupied”). We compare the performance of models trained with occupancy and multi-class labeling (see Table 2 [completion] vs. [joint]). We also compare with Firmal et al.  and Zheng et al.  which both predict binary voxel occupancy based on a single depth map without semantic understanding of the scene. We use the re-implementation of Zheng et al.’s approach from Firman et al., which only provides the completion result. We evaluate on the rendered NYU benchmark with the same test images used by Firman at al. (randomly picked 200 images from the full test set). While Firman et al. produces good results for many cases, their approach fails when the scene becomes complex. For instance, their algorithm fails to complete half of the bed in the first row of Figure 7, and also fails to complete the chairs in the fifth row. In contrast, SSCNet is able to better complete the geometry by leveraging the semantics of the 3D context. This result validates the idea that it is beneficial to understand object semantics in order to achieve better scene completion.
|scene completion||semantic scene completion|
To answer this question, we trained a model with a loss only accounting for semantic labels evaluated on the visible surface and compared with the model trained jointly with labeling and completion (see Table 3 [no completion] vs. [joint]). Even when only evaluating on the visible surface, the model trained with the added supervision of the scene completion task outperforms the model trained only on surface labeling ( vs. ). This demonstrates that understanding complete object geometry and the 3D context is beneficial for recognizing objects.
To investigate the effect of using synthetic training data, we compared models trained only with NYU and models pre-trained on SUNCG and then fine-tuned on NYU (see Tables 1 and 2 NYU vs. NYU+SUNCG). We see a performance gain by using additional synthetic data especially for the semantic scene completion task having an improvement in IoU.
In Table 3, the networks labeled [Basic] and [Basic+D] have the same number of parameter, while in [Basic+D] three convolution layers are replaced by dilated convolution, increasing the receptive field from to . Increasing the receptive field gives the network a opportunity to capture richer contextual information and significantly improve the network performance from to . To visualize the contextual information learned by the network, we perform the following experiment: given a depth map of a single object we predict labels for all unobserved voxels. Figure 8 shows the input depth and the predictions. Even without observing depth information for other objects SSCNet hallucinates plausible contextual object based on the observed object.
Comparing the network performance with and without the aggregation layer (see Table 3 [Basic+D] vs. [Basic+D+M]), we observe that the model with aggregation yields higher IoU for both the scene completion and semantic scene completion tasks by and respectively.
The last three rows in Table 3 compare different volumetric encodings: projective TSDF [proj.], accurate TSDF [tsdf], and flipped TSDF [flipped]. We observe that removing the view dependency by using the accurate TSDF gives a improvement in IoU. Making the gradients concentrated on the surface with the flipped TSDF leads to a improvement.
To balance the empty and occupied voxel examples, we proposed to sample the empty voxels during training. In Table 3, [no balancing] shows the performance when we remove the sampling process during training, where we see a drop in IoU from to .
Firstly, we do not use any color information, so objects missing depth such as “windows” are hard to handle. This also leads to confusion between objects with similar geometry or functionality. For example, in the second row of Figure 7 the network predicts the desk as the broader furniture category. Secondly, due to the GPU memory constraints, our network output resolution is lower than that of input volume. This results in less detailed geometry and missing small objects, such as the missed objects on the desk of the second row in Figure 7.
In this paper, we introduced SSCNet, a 3D ConvNet for the semantic scene completion task of jointly predicting volumetric occupancy and semantic labels for full 3D scenes. We trained this network on a new large-scale synthetic 3D scene dataset. Experiment results demonstrate that our joint model outperforms methods addressing either component task in isolation, and that by leveraging the 3D contextual information and the synthetic training data, we significantly outperform alternative approaches on the semantic scene completion task.
This work is supported by Intel, Adobe, and NSF (IIS-1251217 and VEC 1539014/ 1539099). It makes use of data from Planner5D and hardware donated by NVIDIA and Intel. Shuran Song is supported by a Facebook Fellowship. Helpful advice was provided by Michael Firman, Yinda Zhang, Matthias Niessner and Angela Dai.
In this section, we present several statistics related to our SUNCG dataset. We start by providing the basic statistics of scene structure and physical size for 3D scenes in our dataset, and then move on to talk about higher-level statistics regarding object categories, room types, and object-room relationships.
Figure 9 illustrates the distribution of number of rooms and number of floors per scene in the SUNCG dataset. The 3D scenes in our dataset are range from single room studio to multi-floor houses. The average and median number of rooms per-house are 8.9 and 7 respectively. The average and median number of floors per-house are 1.3 and 1 respectively.
All object meshes and 3D scenes in the SUNCG dataset are measured in real-world spatial dimensions (units are in meters). Figure 10 shows statistics related to physical size over three levels: rooms, floors and houses.
Figure 11 shows the room type distribution and several example rooms per type from our dataset. In total, we have 24 room types that are labeled by the user during creation. These labels include: living room, kitchen, bedroom, child room, dining room, bathroom, toilet, hall, hallway, office, guest room, wardrobe, room, lobby, storage, boiler room, balcony, loggia, terrace, entryway, passenger elevator, freight elevator, aeration, and garage. The four most common room types in our dataset are bedroom, living room, kitchen and toilet, which agrees with the distribution in real-world living spaces.
Figure 14 shows overall object category occurrence in the SUNCG dataset. Figure 14 shows examples of object models from the object library, which contains a diverse set of common furniture and objects for common living spaces. Furthermore, during the creation of the 3D scenes, users have the flexibility to reshape, resize, and re-apply texture to objects to better fit the room style, which further improves the dataset diversity.
With complete object and room type annotations, we can further study the object-room relationships in our dataset. Figure 12 (a) shows the distribution of number of objects per room. Figure 12 (b) shows the distribution of object categories conditioned on different room types. On average there are more than 14 objects in each room. The occurrence and arrangements of these objects in rooms provide rich contextual information that we can learn from.
German Conference on Pattern Recognition (GCPR), 2015.
Proceedings of the IEEE International Conference on Computer Vision, pages 1425–1432, 2013.
VoxNet: A 3D convolutional neural network for real-time object recognition.In IROS, 2015.
A search-classify approach for cluttered indoor scene understanding.ACM Trans. Graph. (Proc. SIGGRAPH Asia), 31(6), 2012.