Complete 3D Scene Parsing from Single RGBD Image

10/25/2017 ∙ by Chuhang Zou, et al. ∙ 0

Inferring the location, shape, and class of each object in a single image is an important task in computer vision. In this paper, we aim to predict the full 3D parse of both visible and occluded portions of the scene from one RGBD image. We parse the scene by modeling objects as detailed CAD models with class labels and layouts as 3D planes. Such an interpretation is useful for visual reasoning and robotics, but difficult to produce due to the high degree of occlusion and the diversity of object classes. We follow the recent approaches that retrieve shape candidates for each RGBD region proposal, transfer and align associated 3D models to compose a scene that is consistent with observations. We propose to use support inference to aid interpretation and propose a retrieval scheme that uses convolutional neural networks (CNNs) to classify regions and retrieve objects with similar shapes. We demonstrate the performance of our method compared with the state-of-the-art on our new NYUd v2 dataset annotations which are semi-automatically labelled with detailed 3D shapes for all the objects.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 6

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we aim to predict the complete D models of indoor objects and layout surfaces from single RGBD images. The prediction is represented by detailed CAD objects with class labels and 3D planes of layouts. This interpretation is useful for robotics, graphics, and human activity interpretation, but difficult to produce due to the high degree of occlusion and the diversity of object classes.

Our approach.

We propose an approach to recover a 3D model of room layout and objects from an RGBD image. A major challenge is how to cope with the huge diversity of layouts and objects. Rather than restricting to a parametric model and a few detectable object classes, as in previous single-view reconstruction work, our models represent every layout surface and object with a 3D mesh that approximates the original depth image under projection. We take a data-driven approach that proposes a set of potential object regions, matches each region to a similar region in training images, and transfers and aligns the associated labeled 3D models while encouraging their agreement with observations. During the matching step, we use CNNs to retrieve objects of similar class and shape and further incorporate support estimation to aid interpretation. We hypothesize, and confirm in experiments, that support height information will help most for interpreting occluded objects because the full extent of an occluded object can be inferred from support height. The subset of proposed 3D objects and layouts that best represent the overall scene is then selected by our optimization method based on consistency with observed depth, coverage, and constraints on occupied space. The flexibility of our models is enabled through our approach (Fig. 

1) to propose a large number of likely layout surfaces and objects and then compose a complete scene out of a subset of those proposals while accounting for occlusion, image appearance, depth, and layout consistency.

Figure 1: Overview of our approach. Given one RGBD image, our approach is to perform complete D parsing. We generate layout proposals and object proposals, predict each proposal’s support height and class and retrieve a similar object shape. We then select a subset of layout and object shapes to compose the scene with complete D interpretation.

Detailed 3D labeling. Our approach requires a dataset with labeled 3D shape for region matching, shape retrieval, and evaluation. We make use of the NYUd v2 dataset [Silberman et al.(2012)Silberman, Hoiem, Kohli, and Fergus] which consists of 1449 indoor scene RGBD images, with each image segmented and labeled with object instances and categories. Each segmented object also has a corresponding annotated 3D model, provided by Guo and Hoiem [Guo and Hoiem(2013)]

. The 3D labeling provides groundtruth 3D scene representation with layout surfaces as 3D planar regions, furniture as CAD exemplars, and other objects as coarser polygonal shapes. However, the polygonal shapes are too coarse to enable comparison of object shapes. Therefore, we extend the labeling by Guo and Hoiem with more detailed 3D annotations in the object scale. Annotations are labeled automatically and are adjusted manually as described in Sec. 

3. We evaluate our method on our newly annotated groundtruth. We measure success according to accuracy of depth prediction of complete layout surfaces, voxel occupancy accuracy, and semantic segmentation performance.

We evaluate our method on our newly annotated NYUd v2 dataset. Experiments on object retrieval demonstrate better region classification and shape estimation compared with the state-of-the-art. Our performance of scene composition shows better semantic segmentation results and competitive D estimation results compared with the state-of-the-art.

Our contributions are:

  1. We refine the NYUd v2 dataset with detailed D shape annotations for all the objects in each image. The labelling process is performed by a semi-automatic approach and can be utilized for labelling other RGBD scene datasets.

  2. We apply support inference to aid region classification in images.

  3. We use CNNs to classify regions and retrieve objects with similar shapes;

  4. We demonstrate better performance in full D scene parsing from single RGBD images compared with the state-of-the-art.

This paper is an extension of our previous work [Guo et al.(2015)Guo, Zou, and Hoiem] (available only on arxiv) that predicts full 3D scene parsing from an RGBD image. Our main new contributions are refinement of the NYUd v2 dataset with detailed 3D shape annotations, use of CNNs to classify regions and retrieve object models with similar shapes to a region, and use of support inference to aid region classification. We also provide more detailed discussion and conduct more extensive experiments, demonstrating qualitative and quantitative improvement.

2 Related work

The most relavent work to our paper is Guo et al [Guo et al.(2015)Guo, Zou, and Hoiem] which predicts the complete 3D models of indoor scene from single RGBD image as introduced above. Both our approach and Guo et al. recover complete models from a limited viewpoint, which is in contrast to the multiview whole-room D context interpretation method by Zhang et al. [Zhang et al.(2014)Zhang, Song, Tan, and Xiao] that advocate making use of 360 full-view panoramas. We introduce other related topics as follows.

Inferring shape from single RGB (D) Image. Within RGB images, Lim et al. [Lim et al.(2013)Lim, Pirsiavash, and Torralba, Lim et al.(2014)Lim, Khosla, and Torralba] find furniture instances and Aubry et al. [Aubry et al.(2014)Aubry, Maturana, Efros, Russell, and Sivic] recognize chairs using HOG-based part detectors. In RGBD images, Song and Xiao [Song and Xiao(2014)] search for main furniture by sliding 3D shapes and enumerating all possible poses. Similarly, Gupta et al. [Gupta et al.(2015)Gupta, Arbeláez, Girshick, and Malik] fit shape models of classes with poses to improve object detection. Our approach finds an approximate shape for any object and layout in the scene. We take an exemplar-based approach, apply region-to-region retrieval to transfer similar D shape from training region to query region.

Semantic segmentation for single RGBD Images. Silberman et al. [Silberman et al.(2012)Silberman, Hoiem, Kohli, and Fergus] use both image and depth cues to jointly segment the objects into categories and infer support relations. Gupta et al. [Gupta et al.(2013)Gupta, Arbelaez, and Malik] apply both generic and class-specific features to assign -class region labels. Their following work encodes both RGB and depth descriptor (HHA) [Gupta et al.(2014)Gupta, Girshick, Arbeláez, and Malik]: height above ground, angle with gravity and the horizontal disparity into the CNNs structure for better region classification. Long et al. [Long et al.(2015)Long, Shelhamer, and Darrell] introduce a fully convolutional network structure for learning better features. Our method make use of the RGB and HHA features to categorize objects into a larger variety of classes to distinguish infrequent objects in the scenes, rather than restricting to a parametric model and a few detectable objects. Though semantic segmentation is not the main purpose of our method, we can infer region labels by projecting the D scene models to the D images.

Support height estimation. Guo and Hoiem [Guo and Hoiem(2013)] localize the height and full extent of support surfaces from one RGBD image. In addition, object height priors have shown to be crucial geometric cues for better object detection in both 2D [Hoiem et al.(2008)Hoiem, Efros, and Hebert, Walk et al.(2010)Walk, Schindler, and Schiele] and 3D [Lin et al.(2013)Lin, Fidler, and Urtasun, Song and Xiao(2014), Gupta et al.(2015)Gupta, Arbeláez, Girshick, and Malik]. Deng et al. [Deng et al.(2015)Deng, Todorovic, and Jan Latecki] applies height above ground to distinguish objects. We propose to use object’s support height to aid region class interpretation, which helps in classification of occluded regions by distinguishing objects that appear at different height levels: e.g. chair should be on the floor and alarm clock should be on the table.

Figure 2: Samples of our semi-automatic D object annotations in NYUd v2 dataset. Images from first to third row: input RGB (D) image, D annotations by Guo and Hoiem [Guo and Hoiem(2013)], our refined D annotations. Our annotations of the scene is much detailed in object shape scale.

3 Detailed D annotations for indoor scenes

We conduct our experiments on the NYUdv2 dataset [Silberman et al.(2012)Silberman, Hoiem, Kohli, and Fergus], which provides complete 3D labeling of both objects and layouts of 1449 RGB-D indoor images. Each object and layout has a 2D segment labeling and a corresponding annotated 3D model, provided by Guo and Hoiem [Guo and Hoiem(2013)]. The D annotations use 30 models to represent 6 categories of furniture that are most common and use extruded polygons to label all other objects. These models provide a good approximation of object extent but are often poor representations of object shape, as shown in Fig 2. Therefore, we extend the NYUd v2 dataset by replacing the extruded polygons with CAD models collected from ShapeNet [Chang et al.(2015)Chang, Funkhouser, Guibas, Hanrahan, Huang, Li, Savarese, Savva, Song, Su, Xiao, Yi, and Yu] and ModelNet [Wu et al.(2015)Wu, Song, Khosla, Yu, Zhang, Tang, and Xiao]. To align name-space between datasets, we manually map all model class labels to the -class D object labels in NYUd v2 dataset. The shape retrieval and alignment process is performed automatically and then adjusted manually, as follows.

Coarse alignment. For each groundtruth D region in the NYUd v2 dataset, we retrieve model set from our collected models that have the same class label as . We also include the region’s original coarse D annotation by Guo and Hoiem [Guo and Hoiem(2013)] in the model set , so that we can preserve the original labeling if no provided CAD models are better fit in depth. We initialize each ’s 3D location as the world coordinate center of the D annotation labeled by Guo and Hoiem. We resize to have the same height as the D annotation.

Fine alignment. Next, we align each retrieved 3D object model to fit the available depth map of the corresponding 2D region in the target scene. The initial alignment is often not in the correct scale and orientation; e.g., a region of a left-facing chair often resembles a right-facing chair and needs to be rotated. We found that using Iterative Closest Point to solve for all parameters did not yield good results. Instead, we enumerate 16 equally-spaced orientations from -180 to 180 from top-down view and allows 2 minor scale revision ratio as . We perform ICP to solve for translation initialized using scale and rotation, and pick the best ICP result based on the following cost function:

(1)

where represents scale, rotation, and translation, is the mask of the rendered aligned object, denotes the observed depth at pixel and means the rendered depth at . The first term encourages depth similarity to the groundtruth RGBD region; the second penalizes pixels in the proposed region that are not rendered, and the third term penalizes pixels in the rendered model that are closer than the observed depth image (so the model does not stick out into space known to be empty).

Based on the fitting cost of Eq. 3, our algorithm picks the model with the best translation, orientation, and scale . We set the term weights , , as based on a grid search in the validation set. For efficiency, we first obtain the top 5 models based on the fitting cost, each maximized only over the 16 initial orientations before ICP. For each of these models, we then solve for the best translation for each scale and rotation based on Eq. 3 and finally select the aligned model with the lowest fitting cost.

Figure 3: Cumulative relative depth error of our detailed D annotations and the D annotations by Guo and Hoiem [Guo and Hoiem(2013)] in NYUd v2 dataset.

Post-processing. Automatic fitting may fail due to high occlusion or missing depth values. We manually conduct a post-processing check and refine bad-fitting models, which affects roughly of models. Using a GUI, an annotator checks the automatically produced shape for each region. If the result is not satisfactory, the user compares to other top model fits, and if none of those are good matches, then the fitting optimization based on Eq. 3 is applied to the original polygonal 3D labeling. This helps to ensure that our detailed shape annotations are a strict improvement over the original course annotations.

Validation. Figure 3 reports the cumulative relative error of the rendered depth of our detailed D annotations compared with groundtruth depth in NYUd v2 dataset. The relative error is computed as:

(2)

where is all the RGBD images in the dataset; represents a pixel in each image ; is the groundtruth depth of pixel from sensor; and is the rendered depth of the 3D label annotation at pixel . For comparison, we report the of the D annotations by Guo and Hoiem [Guo and Hoiem(2013)]. Our annotations have more points with low relative depth error, and achieve a better modeling of depth for each image.

4 Generating object candidates

Given an RGBD image as input, we aim to find a set of layout and object models that fit RGB and depth observations and provide a likely explanation for the unobserved portion of the scene. To produce object candidate regions, we use the method for RGBD images by Gupta et al. [Gupta et al.(2014)Gupta, Girshick, Arbeláez, and Malik] and extract top ranked region proposals for each image. Our experiments show that this region retrieval is more effective than the method based on Prims algorithm [Manen et al.(2013)Manen, Guillaumin, and Van Gool] used in our previous work [Guo et al.(2015)Guo, Zou, and Hoiem]. Likely object categories and 3D shapes are then assigned to each candidate region.

4.1 Predicting region’s support height and class

Figure 4:

The CNN for predicting a candidate object’s support height. We perform ReLU between the convolutional layer and the max pooling layer. Local response normalization is performed before the first fully connected (FC) layer. We add dropout with 0.5 before each FC layer during training.

We train and use CNN networks to predict the object category and support height of each region, as shown in Fig. 4 and Fig. 5. The support height is used as a feature for the object classification. We also train and use a CNN with a Siamese network design to find the most similar 3D shape of a training object, based on region and depth features.

Support height prediction. We predict support height for each object with the aim of better predicting class and position. We first find candidate support heights using the method of Guo and Hoiem [Guo and Hoiem(2013)]

and use a CNN to estimate which is most likely for a given object region based on crops of the depth maps and height maps of the region proposal and region that extends from bottom of the region proposal to the estimated floor position. The support height network also predicts whether the object is supported from below or behind. To create the feature vector, we subtract the candidate support height from the height cropped images, re-size all four crops to

patches, and concatenate them, as illustrated in Fig. 4.

Figure 5: The CNNs for region classification (left) and similar shape retrieval (right). We perform ReLU and dropout with 0.5 after the first FC layer for both of the networks during training.

In the test set, we identify the closest candidate support height with accuracy, with an average distance error of m. As a feature for classification, we use the support height relative to the camera height, which leads to slightly better performance than using support height relative to the estimated ground likely because dataset images are taken from consistent heights but estimated ground height may be mistaken.

Categorization.

Our classification network gets input of the region proposal’s support height and type, along with CNN features from both RGB and depth. The network predicts the probability for each class as shown in Fig. 

5. To model the various classes of shapes in indoor scene, we classify the regions into the most common classes, which have at least 10 training samples. Less common objects are classified as “other prop", “other furniture" and “other structure" based on rules by Silberman et al. [Silberman et al.(2012)Silberman, Hoiem, Kohli, and Fergus]. In addition, we identify a region proposal that is not representative for an object shape (e.g. a piece of chair leg region when the whole chair region is visible) as a “bad region" class. This leads to our -class classifier. The input support height and type are directly predicted by our support height prediction network. To create our classification features, we copy the two predicted support values 100 times each (a useful trick to reduce sensitivity to local optima for important low-dimensional features) and concatenate them to the region proposal’s RGB and HHA features from Gupta et al. [Gupta et al.(2014)Gupta, Girshick, Arbeláez, and Malik] in both the D bounding box and masked region as in [Hariharan et al.(2014)Hariharan, Arbeláez, Girshick, and Malik]. Experiments show that using the predicted support type and the support height improves the classification accuracy by about , with larger improvement for occluded objects.

4.2 Predicting region’s D shape

Using a Siamese network (Fig. 5), we learn a region-to-region similarity measure that predicts 3D shape similarity of the corresponding objects. The network embeds the RGB and HHA features used in our classification network into a space where cosine distance correlates to shape similarity, as in [Yih et al.(2011)Yih, Toutanova, Platt, and Meek]. In training, we use surface-to-surface distance [Rock et al.(2015)Rock, Gupta, Thorsen, Gwak, Shin, and Hoiem]

between mesh pairs as the ground truth similarity and train the network to penalize errors in shape similarity orderings. Each region pair’s shape similarity score is compared with the next pair’s among the randomly sampled batch in the current epoch and penalized only if the ordering disagrees with the groundtruth similarity. We attempted sharing embedding weights with the classification network but observed a

drop in classification performance. We also found predicted class probability to be unhelpful for predicting shape similarity.

Candidate region selection. We apply the above retrieval scheme to each of the region proposals in each image, obtaining shape similarity rank compared with all the training samples and -object class and non-object class probability for each region proposal. In order to reduce the number of retrieved candidates before the scene interpretation in Sec. 5, we first reduce the number of region proposals using non-maximal suppression based on non-object class probability and threshold on the non-object class probability. We set the threshold to obtain region proposals for each image, on average. We select the two most probable classes for each remaining region proposal and select five most similar shapes for each class, leading to shape candidates for each region proposal.

Then, we further refine these 10 retrieved shapes. We align each shape candidate to the target scene by translating the model using the offset between the depth point mass centers of the region proposal and the retrieved region. We then perform ICP, using a grid of initial values for rotation and scale and pick the best one for each shape based on the fitting energy in Eq. 3. We set the term weight: as , as , as based on a grid search on the validation set. We tried using the estimated support height for each region proposal for aligning the related D shape models but observed a worse performance in the scene composition result. This is because a relatively small error in object’s support height estimation can cause a larger error in fitting.

Finally, we select the two most promising shape candidates based on the following energy function,

(3)

is the fitting energy defined in Eq. 3 that we used for alignment. and are the softmax class probability and the non-object class probability output by our classification network for the regon proposal . We normalize to sum to , in order not to penalize the non-object class twice in the energy function. We set the term weights , , using a grid search. Note that is on the scale of the number of pixels in the region.

4.3 Training details

We first find meta-parameters on the validation set after training classifiers on the training set. Then, we retrain on both training and validation in order to report results on the test set. We train our networks with the region proposals that have the highest (and at least 0.5) D intersection-over-union (IoU) with each groundtruth region in the train set. We train the support height prediction network with the groundtruth support type for the region proposal and set the groundtruth support height as the closest support height candidate that is within meters from the related D annotation’s bottom. For training regions that are supported from behind, we do not penalize the support height estimation, since our support height candidates are for vertical support. For training the classification network, we also include the non-object class region proposals that have IoU with the groundtruth regions. We randomly sample the same number of the non-object regions as the total number of the object regions during training. To avoid unbalanced weights for different classes, we sample from the dataset the same number of training regions for each class in each epoch. When training the shape similarity network, we translate each D model to origin and resize to -voxel cuboid before computing the surface-to-surface distance. We use ADAM [Kingma and Ba(2014)] to train each network with the hyper-parameter of , . The learning rate for each network is: support height prediction , classification and Siamese network .

5 Scene composition with candidate D shape

Given a set of retrieved candidate D shapes and the layout candidates from Guo et al. [Guo et al.(2015)Guo, Zou, and Hoiem], we select a subset of the candidates that closely reproduce the original depth image when rendered, correspond to minimally overlapping 3D aligned models. The composition is hard because of the high degree of occlusions and large amount of objects in the scene. We apply the same method proposed by Guo et al. [Guo et al.(2015)Guo, Zou, and Hoiem] to perform scene composition, this leads to our final scene interpretation.

6 Experiments

6.1 Experiments setting

We evaluate our method on our detailed D annotated NYUd v2 dataset. We also report the result of the state-of-the-art by Guo et al. [Guo et al.(2015)Guo, Zou, and Hoiem] on our new annotations. Same as Gupta et al. [Gupta et al.(2014)Gupta, Girshick, Arbeláez, and Malik], we tune the parameters on the validation set while training on the train set; we report our result on the test set by training on both the train and validation set.

Method
avg per
class
avg
precision
avg over
instance
picture chair cabinet pillow bottle books paper table box window door sofa bag lamp clothes
w/o support height 43.7 0.3 37.7 0.1 40.8 0.3 57.5 46.1 39.5 66.5 55.6 30.0 40.7 36.7 10.6 61.4 54.9 63.0 14.9 64.6 25.3
w/ support height 44.7 0.3 39.70.1 42.70.2 57.5 53.1 44.35 69.0 54.9 33.5 43.5 39.5 14.1 62.4 57.8 65.9 15.1 65.9 26.9
Table 1: Our 81-class classification accuracy on groundtruth 2D regions in the test set.

We compare two methods: our classification network with/without estimating support height. We compute the average accuracy for each class, average precision based on the predicted probability and the accuracy averaged over instances. The classification networks are trained and evaluated 10 times and the means and standard deviations (reflecting variation due to randomness in learning) are reported. 15 common object class results are also listed. Bold numbers signify better performance.

6.2 Evaluation of region classification and shape retrieval

Occlusion Ratio
w/ support height
w/o support height
Table 2: Classification accuracy under different occlusion ratios of groundtruth regions in test set

Classification. We report the region classification accuracy on the groundtruth D regions in the test set by our classification network, as shown in Table 1. Overall, including support height prediction will improve the classification results. For certain classes, objects often appear on the floor level (e.g. chair, desk) will have better classification accuracy given object’s height, while objects often appear on several heights (e.g. picture) do not have this benefit.

Table. 2 shows the per instance classification accuracy under different occlusion ratios of the groundtruth regions in the test set. The improvement in classification accuracy is larger for highly occluded area, which conforms with our claim that estimating object’s support height will help classify occluded regions.

Retrieval. We evaluate our candidate shape retrieval method compared with the state-of-the-art by Guo et al. [Guo et al.(2015)Guo, Zou, and Hoiem] given groundtruth D regions in the dataset, as shown in Table 3. Since we use groundtuth regions, we do not perform candidate selection in Sec. 4.2 for this evaluation. We evaluate 1) top retrieved class accuracy and 2) top retrieved shape similarity, based on the shape intersect over union (IoU) and the surface-to-surface distance. In our experiment, we set . To avoid rotation ambiguity in shape similarity measurement, we rotate each retrieved object to find the best shape similarity score with the groundtruth D shape. Our retrieval method outperforms the state-of-the-art under all the evaluation criteria.

Method avg class accuracy () avg 3D IoU avg surface distance (m)
top 1 top 2 top 3 top 1 top 2 top 3 top 1 top 2 top 3
Guo et al. [Guo et al.(2015)Guo, Zou, and Hoiem]
Ours 41.56 54.47 62.07 0.191 0.231 0.249 0.026 0.024 0.023

Table 3: Quantitative evaluation for our retrieval method compared with Guo et al’s method.

6.3 Evaluation of scene composition

84-class semantic segmentation. We evaluate the 84-class semantic segmentation (81 object classes and 3 layout classes: wall, ceiling, floor) on the rendered D image of our scene composition result. We compare our method with Guo et al. [Guo et al.(2015)Guo, Zou, and Hoiem] with both automatically generated region proposals and groundtruth regions. Table 4 shows the average class accuracy (avacc), average class accuracy weighted by frequency (fwavacc) and the average pixel accuracy (pixacc). Our method has a better semantic segmentation result. We report our result with/without support height prediction and the result given the groundtruth support height as the upper bound. We see improvements, compared to Guo et al., due to both classification method (CNN vs. CCA) and region proposal method (MCG by Gupta et al. vs. prim’s by Guo et al.). We also report the -class semantic segmentation compared with the state-of-the-art [Long et al.(2015)Long, Shelhamer, and Darrell]: pixacc / avacc / fwavacc vs. their //. Note that our method focuses on predicting D geometry rather than semantic segmentation.

Method Mean coverage Mean coverage
weighted unweighted
Guo et al. [Guo et al.(2015)Guo, Zou, and Hoiem]
MCG+CCA
Ours w/o support 50.43 34.34
Ours
Ours w/ GT support
Ours w/ GT-region 52.39
Detailed groundtruth labeling
(a) Overall results
(b) Each class’s w/ support accuracy minus w/o support accuracy with automatic region proposals
Table 4: Results of 84-class semantic segmentation on both automatic region proposals and groundtruth regions.

84-class instance segmentation. We report the instance segmentation result with the experiment setting same as semantic segmentation. The evaluation follows the protocol in RMRC [Urtasun et al.(2013)Urtasun, Fergus, Hoiem, Torralba, Geiger, Lenz, Silberman, Xiao, and Fidler]. Our method outperforms the state-of-the-art slightly. Note that the lower results for MeanCovU is caused by the large diversity of classes with less frequency.

Method Mean coverage Mean coverage
weighted unweighted
Guo et al. [Guo et al.(2015)Guo, Zou, and Hoiem]
MCG+CCA
Ours w/o support 50.43 34.34
Ours
Ours w/ GT support
Ours w/ GT-region 52.39
Detailed groundtruth labeling
Table 5: Results of 84-class instance segmentation on both automatic region proposals and groundtruth regions.

3D estimation. We evaluate 3D estimation as in Guo et al. [Guo et al.(2015)Guo, Zou, and Hoiem] with layout pixel-wise and depth prediction and freespace and occupancy estimation. As shown in Fig. 6, our method has competitive results in D estimation compared with the state-of-the-art.

Method Layout Pixel Error Layout Depth Error Freespace Occupancy
overall visible occluded overall visible occluded precision recall precision recall precision- recall-
Guo et al. [Guo et al.(2015)Guo, Zou, and Hoiem] 0.504 0.751
Ours 10.6 13.6 0.15 0.181 0.919 0.397 0.710
Table 6: Results of 3D layout, freespace and occupancy estimation.

Qualitative results. Sample qualitative results are shown in Fig. 6 and Fig. 7. Our method has better region classification and shape estimation property, which results in a slightly better scene composition result. Failure cases are caused by bad pruning in region proposals (last two column, Fig. 6) and confusion between similar class (top right, Fig. 7).

Figure 6: Qualitative results on scene composition with automatic region proposals. We randomly sample images from the top 25% (first four rows), medium 50% (row 6-9) and worst 25% (last four rows) based on 84-class semantic segmentation accuracy.
Figure 7: Qualitative results on scene composition given groundtruth 2D labeling as region proposals. We randomly sample images from the top 25% (first two rows), medium 50% (row 3-4) and worst 25% (last two rows) based on 84-class semantic segmentation accuracy.

7 Conclusions

In this paper, we predict the full D parse of both visible and occluded portions of the scene from one RGBD image. We propose to use support inference to aid interpretation and propose a retrieval scheme that uses CNNs to classify regions and find objects with similar shapes. Experiments demonstrate better performance of our method in semantic segmentation and instance segmentation and competitive results in D scene estimation.

Acknowledgements

This research is supported in part by ONR MURI grant N000141010934 and ONR MURI grant N000141612007. We thank David Forsyth for insightful comments and discussion.

References

  • [Aubry et al.(2014)Aubry, Maturana, Efros, Russell, and Sivic] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic. Seeing 3d chairs: Exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.
  • [Chang et al.(2015)Chang, Funkhouser, Guibas, Hanrahan, Huang, Li, Savarese, Savva, Song, Su, Xiao, Yi, and Yu] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  • [Deng et al.(2015)Deng, Todorovic, and Jan Latecki] Zhuo Deng, Sinisa Todorovic, and Longin Jan Latecki. Semantic segmentation of rgbd images with mutex constraints. In Proceedings of the IEEE International Conference on Computer Vision, pages 1733–1741, 2015.
  • [Guo and Hoiem(2013)] Ruiqi Guo and Derek Hoiem. Support surface prediction in indoor scenes. In ICCV, 2013.
  • [Guo et al.(2015)Guo, Zou, and Hoiem] Ruiqi Guo, Chuhang Zou, and Derek Hoiem. Predicting complete 3d models of indoor scenes. arXiv preprint arXiv:1504.02437, 2015.
  • [Gupta et al.(2013)Gupta, Arbelaez, and Malik] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. Perceptual organization and recognition of indoor scenes from RGB-D images. In CVPR, 2013.
  • [Gupta et al.(2014)Gupta, Girshick, Arbeláez, and Malik] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning rich features from rgb-d images for object detection and segmentation. In European Conference on Computer Vision, pages 345–360. Springer, 2014.
  • [Gupta et al.(2015)Gupta, Arbeláez, Girshick, and Malik] Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Aligning 3d models to rgb-d images of cluttered scenes. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4731–4740, 2015.
  • [Hariharan et al.(2014)Hariharan, Arbeláez, Girshick, and Malik] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Simultaneous detection and segmentation. In Computer vision–ECCV 2014, pages 297–312. Springer, 2014.
  • [Hoiem et al.(2008)Hoiem, Efros, and Hebert] Derek Hoiem, Alexei A Efros, and Martial Hebert. Putting objects in perspective. International Journal of Computer Vision, 80(1):3–15, 2008.
  • [Kingma and Ba(2014)] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [Lim et al.(2013)Lim, Pirsiavash, and Torralba] Joseph J Lim, Hamed Pirsiavash, and Antonio Torralba. Parsing IKEA objects: Fine pose estimation. In ICCV, 2013.
  • [Lim et al.(2014)Lim, Khosla, and Torralba] Joseph J Lim, Aditya Khosla, and Antonio Torralba. Fpm: Fine pose parts-based model with 3d cad models. In European Conference on Computer Vision, pages 478–493. Springer, 2014.
  • [Lin et al.(2013)Lin, Fidler, and Urtasun] Dahua Lin, Sanja Fidler, and Raquel Urtasun. Holistic scene understanding for 3d object detection with rgbd cameras. In ICCV, 2013.
  • [Long et al.(2015)Long, Shelhamer, and Darrell] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
  • [Manen et al.(2013)Manen, Guillaumin, and Van Gool] Santiago Manen, Matthieu Guillaumin, and Luc Van Gool. Prime object proposals with randomized prim’s algorithm. In Proceedings of the IEEE International Conference on Computer Vision, pages 2536–2543, 2013.
  • [Rock et al.(2015)Rock, Gupta, Thorsen, Gwak, Shin, and Hoiem] Jason Rock, Tanmay Gupta, Justin Thorsen, JunYoung Gwak, Daeyun Shin, and Derek Hoiem. Completing 3d object shape from one depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2484–2493, 2015.
  • [Silberman et al.(2012)Silberman, Hoiem, Kohli, and Fergus] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. Computer Vision–ECCV 2012, pages 746–760, 2012.
  • [Song and Xiao(2014)] Shuran Song and Jianxiong Xiao. Sliding shapes for 3d object detection in depth images. In European conference on computer vision, pages 634–651. Springer, 2014.
  • [Urtasun et al.(2013)Urtasun, Fergus, Hoiem, Torralba, Geiger, Lenz, Silberman, Xiao, and Fidler] R Urtasun, R Fergus, D Hoiem, A Torralba, A Geiger, P Lenz, N Silberman, J Xiao, and S Fidler. Reconstruction meets recognition challenge, 2013.
  • [Walk et al.(2010)Walk, Schindler, and Schiele] Stefan Walk, Konrad Schindler, and Bernt Schiele. Disparity statistics for pedestrian detection: Combining appearance, motion and stereo. In Computer Vision–ECCV 2010, pages 182–195. Springer, 2010.
  • [Wu et al.(2015)Wu, Song, Khosla, Yu, Zhang, Tang, and Xiao] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
  • [Yih et al.(2011)Yih, Toutanova, Platt, and Meek] Wen-tau Yih, Kristina Toutanova, John C Platt, and Christopher Meek. Learning discriminative projections for text similarity measures. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 247–256. Association for Computational Linguistics, 2011.
  • [Zhang et al.(2014)Zhang, Song, Tan, and Xiao] Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao. Panocontext: A whole-room 3d context model for panoramic scene understanding. In European Conference on Computer Vision, pages 668–686. Springer, 2014.