Log In Sign Up

PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds

3D scene understanding from point clouds plays a vital role for various robotic applications. Unfortunately, current state-of-the-art methods use separate neural networks for different tasks like object detection or room layout estimation. Such a scheme has two limitations: 1) Storing and running several networks for different tasks are expensive for typical robotic platforms. 2) The intrinsic structure of separate outputs are ignored and potentially violated. To this end, we propose the first transformer architecture that predicts 3D objects and layouts simultaneously, using point cloud inputs. Unlike existing methods that either estimate layout keypoints or edges, we directly parameterize room layout as a set of quads. As such, the proposed architecture is termed as P(oint)Q(uad)-Transformer. Along with the novel quad representation, we propose a tailored physical constraint loss function that discourages object-layout interference. The quantitative and qualitative evaluations on the public benchmark ScanNet show that the proposed PQ-Transformer succeeds to jointly parse 3D objects and layouts, running at a quasi-real-time (8.91 FPS) rate without efficiency-oriented optimization. Moreover, the new physical constraint loss can improve strong baselines, and the F1-score of the room layout is significantly promoted from 37.9


page 2

page 3

page 4

page 5

page 6

page 7

page 11

page 12


LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network

3D room layout estimation by a single panorama using deep neural network...

Floorplan Priors for Joint Camera Pose and Room Layout Estimation

We present a novel approach to reconstruct large or featureless scenes. ...

Iterative Transformer Network for 3D Point Cloud

3D point cloud is an efficient and flexible representation of 3D structu...

Bridged Transformer for Vision and Point Cloud 3D Object Detection

3D object detection is a crucial research topic in computer vision, whic...

Physics Inspired Optimization on Semantic Transfer Features: An Alternative Method for Room Layout Estimation

In this paper, we propose an alternative method to estimate room layouts...

MCTS with Refinement for Proposals Selection Games in Scene Understanding

We propose a novel method applicable in many scene understanding problem...

Geometry-Based Layout Generation with Hyper-Relations AMONG Objects

Recent studies show increasing demands and interests in automatically ge...

Code Repositories

I Introduction

Recent years have witnessed the emergence of 3D scene understanding technologies, which enables robots to understand the geometric, semantic and cognitive properties of real-world scenes, so as to assist robot decision making. However, 3D scene understanding remains challenging due to the following problems: 1) Holistic understanding requires many sub-problems to be addressed, such as semantic label assignment [1], object bounding box localization [3] and room structure boundary extraction [2] etc. However, current methods solve these tasks with separate models, which is expensive in terms of storage and computation. 2) The physical commonsense [5] like gravity [6] or interference [7] between different tasks are ignored and potentially violated, producing geometrically implausible results.

(a) Input Point Cloud
(b) Ground Truth
(c) Our Prediction
Fig. 1: Illustration of PQ-Transformer on a representative scene of ScanNet. (a) Input point cloud, where the RGB values are not the input, but used for visualization merely. Comparing (b) and (c), the proposed PQ-Transformer succeeds to jointly detect 3D objects (green) and estimates room layouts (blue) in an end-to-end fashion, with high accuracy.

Aiming for robust 3D scene understanding, we propose PQ-Transformer, the first algorithm that jointly predicts 3D object bounding boxes and 3D room layouts in one forward pass. As illustrated in Fig. 1(a), the input is 3D point cloud of a scene reconstructed by SDF-based fusion [8][9]. Note that the RGB values are not the inputs, but used for visualization merely. PQ-Transformer predicts a set of 3D object boxes with semantic category labels and another set of quadrilateral (denoted as quads) equations representing structure elements (wall, floor and ceiling). Although these quads are of zero width in nature, we set their widths to a small value for better visualization, as illustrated by flat blue boxes in Fig. 1(c). By comparing to the ground truth in Fig. 1(b), PQ-Transformer successfully addresses both tasks with high accuracy. Such a joint prediction of 3D objects and layouts is favorable for many robotics applications, since it can largely reduce the overhead for both storage and inference.

Furthermore, we propose a new loss function by introducing the physical constraints of two tasks during training. This loss function originates from a natural supervision signal, instead of the human annotated supervisions. Specifically, the interference between object boxes and layout quads are penalized. On one hand, this is consistent with human commonsense and incorporating the constraint makes the learning system closer to the human cognitive system. On the other hand, since trivial mistakes like objects sinking into the grounds are corrected, it is natural to expect a more accurate result. Regarding the design of neural network architecture, a transformer is specifically tailored for the joint prediction task. Using two backbones is computationally expensive while using two linear heads leads to contradictory usage on queries. As such, two sets of proposal queries are separately generated for both tasks, striking a balance between efficiency and accuracy.

Benefited from the new representation and network, PQ-Transformer achieves superior performance on challenging scenes of the public benchmark ScanNet. It succeeds to jointly parse 3D objects and layouts, running at a quasi-real-time (8.91 FPS) rate without efficiency-oriented optimization. Moreover, the proposed physical constraint loss improves strong baselines, and the F1-score of the room layout is significantly promoted from 37.9% to 57.9%. As demonstrated in Fig. 5, the results are useful for both researchers studying 3D scene understanding and practitioners building robotics systems. The technical contributions are summarized as follows.

  • PQ-Transformer is the first neural network architecture that jointly predicts 3D objects and layouts from point clouds in an end-to-end fashion.

  • Unlike former room layout estimation methods that predict features for layout keypoints, edges or facets, we introduce the quad representation and successfully exploit it for discriminative learning.

  • We propose a new physical constraint loss that is tailored for the proposed quad representation, which is principled, generic, efficient and empirically successful.

  • PQ-Transformer achieves competitive performance on the ScanNet benchmark. Notably, a new SOTA is achieved for layout F1-score (from 37.9% to 57.9%).

Ii Related Works

Many 3D scene understanding tasks are initially defined in the image-based setting, before the advent of commodity RGB-D cameras. [10]

proposes a successful statistical feature called geometric context and parses scenes into geometric categories based upon it, which lays the foundation for pre-deep-learning data-driven 3D scene understanding. Early works

[11][12] group line segments according to Manhattan vanishing points and propose primitives for later reasoning, demonstrating the capabilities of estimating room layouts. [13] shows that proposing 3D objects aligned to the Manhattan frame can be used for joint 3D detection and layout estimation. Later Bayesian models [14][15] are introduced into the field, which model object-layout relationships as statistical priors in a principled manner. In the last decade, many sub-problems benefit from the strong representation power of deep neural networks, including but not limited to object detection [16][17], object reconstruction [18][19] and room layout estimation [20][21][22][23]. Recently, joint 3D understanding of several sub-tasks has seen exciting progress, like COPR [24], Hoslitic++ [25] and Total3D [26].

After the advent of commodity RGB-D sensors like Kinect or Realsense, 3D scene understanding with point cloud inputs gradually gains popularity. Since the depth information is readily known, scale ambiguity no longer exists. Yet robust understanding is still challenged by issues like occlusion, expensive annotations and sensor noise. SlidingShapes [27] exploits viewpoint-aware exempler SVMs for 3D detection. DeepSlidingShapes [28] designs a sophiscated 3D proposal network with data-driven anchors. Semantic Scene completion [29][30] jointly completes scenes and assigns 3D semantic labels, taking a single depth image as the input. Point-wise semantic labelling is successsfully addressed by recently proposed architectures like SparseConv [31][1] or PointNet [32]. After looking at aforementioned former arts, it is clear that PQ-Transformer is the first transformer-based architecture that jointly predicts 3D objects and layouts from point clouds, with a new quad representation for layouts and its corresponding physical constraint loss.

Fig. 2: Overview of PQ-Transformer. Given an input 3D point cloud of N points, the point cloud feature learning backbone extracts M context-aware point features of (3+C) dimensions, through sampling and grouping. A voting module and a farthest point sampling (FPS) module are used to generate object proposals and quad proposals respectively. Then the proposals are processed by a transformer decoder to further refine proposal features. Through several feedforward layers and non-maximum suppression (NMS), the proposals become the final object bounding boxes and layout quads.

Iii Method

Our goal is to jointly parse common objects (semantic labels and 3D bounding boxes) and 3D room layouts with a single neural network. To this end, we propose an end-to-end attention-based architecture named PQ-Transformer. We illustrate our architecture in detail with Fig.2.

In the remainder of this section, we first introduce a new representation for 3D room layout, then describe the detailed network architecture. After that, we propose a novel physical constraint loss to refine the joint detection results, by discouraging the interference between objects and layout quads. Finally, we discuss the loss function terms to train PQ-Transformer in an end-to-end fashion.

Iii-a Representation: Layout as Quads

The representation for 3D object detection is mature and clear. Following former arts [3][33], we use center coordinate, orientation and size to describe an object bounding box. However, the representation of room layout is still an open problem. Total3D [26] describes the whole room with a 3D bounding box just like objects. However this representation might not work well because the layout of a real-world room is often non-rectangular. Using a single 3D box isn’t enough to accurately describe it. Like image-based layout estimation, SceneCAD [2] uses layout vertices and edges as the representation. This representation is not compact and requires further fitting to get parametric results.

Different from former methods, we represent the room layouts as a set of quads, which is parametric and compact. Since floors and ceilings are not always rectangular, we only use quads to represent the walls of a room. Then parametric ceiling and floor could be represented by the upper and lower boundaries of the walls. In this way, we formulate the room layout estimation problem into quad detection. Detailed mathematical definition can be found in Section III-C.

Iii-B Network Architecture

The overall network architecture is depicted in Fig.2. It is composed of four main parts: 1) Backbone: a backbone to extract features from point clouds; 2) Proposal modules: two proposal modules to generate possible objects and layout quads respectively; 3) Transformer decoder: a transformer decoder to process proposal features with context-aware point cloud features; 4) Prediction heads: two prediction heads with several feed forward layers to produce the final predictions, in the joint object-layout output space.

Backbone. We implement the point cloud feature learning backbone with PointNet++ modules. Firstly, four set-abstraction layers are used to down-sample the input 3D point cloud and aggregate local point features. Then two feature propagation layers are used to up-sample the points and generate points with features of dimension . Concatenated with coordinates, the extracted features are the context-aware local features of the entire scene. It is used as the input to the following proposal modules and the key of cross-attention layers in the transformer decoder.

Proposal modules. We use a voting module and a farthest point sampling (FPS) module to generate proposals for objects and layout quads, respectively.

Voting. The idea of voting comes from VoteNet[3]

, a technique based on hough voting. Every point in a bounding box is associated with a voting vector towards the box center. To generate votes, we apply a weight-shared multi-layer perceptron (MLP) on

. The -th point in is represented by feature , with as its 3D point coordinate and as its -dimensional feature. The output of this MLP is offsets of coordinate and feature . We get its vote , where and . We then sample a subset of votes by using an FPS module on the value of . Each cluster is a 3D object proposal.

Farthest Point Sampling. We use FPS to generate initial proposals for layout quads. FPS is based on the idea of repeatedly placing the next sample point in the middle of the least-known area of the sampling domain. FPS starts with a randomly sampled point as the first proposal candidate, and iteratively selects the farthest point from the already selected points until

candidates are selected. Though simple, FPS works well for our layout quad detection formulation. Usually the walls are distributed on the outer boundaries of the room and are far from each other. So there is a high probability for FPS to select points on the walls that can provide good enough proposals for quad detection.

Transformer decoder. After generating initial proposals based on voting and FPS, we use a transformer decoder to further refine the proposal features. The three basic elements of attention modules are: query (), key () and value (), whose dimensions are all in our case. Proposal features are denoted as . First, we feed through self-attention:


The self-attention layer exploits the mutual relationship between all object and layout quad proposals.

Fig. 3: The conceptual illustration of physical constraint loss. The left picture is the detection result without physical constraint loss. As shown in the right picture, after adding physical constraint loss, the bounding box that intersects the wall will move inward. As such the detection result will become more accurate and consistent with physical facts.

In addition, we use the context-aware point cloud feature produced by backbone as the key, and fuse it with the proposal features through cross-attention layers:


Here and

are fully connected layers with batch normalization and ReLU. Our transformer decoder has six blocks, with each one consisted of a self-attention layer and a cross-attention layer. Six blocks generate six sets of detection results, respectively. The detection results of a previous block are used as the position encoding into the current block.

Prediction Heads.

After feeding the proposals through the transformer decoder, we use two sets of MLPs as two prediction heads to generate final results. One is used to classify objects and regress object bounding boxes, while the other is used to regress layout quads. For object detection, we follow the formulation of VoteNet

[3], using a vector of size , which consists of objectness scores, center regression values, heading bins, heading regression values, size bins, size regression values for height-width-depth, and semantic categories. For layout quad detection, we use a vector of size 10 which is composed of quadness scores, center regression values, size regression values and normal vector components. Both 3D objects and 2D quads are processed with 3D NMS to remove duplicate boxes, because we give the quad a fixed (but small) width to form a flat cuboid.

Iii-C Physical Constraint

Fig. 4: The mathematical illustration of physical constraints. We use the vertices of the object bounding box (green) to justify whether it intersects with the wall (blue). The equation of wall divides the space into two parts. B2 is out of the room and B1 is inside the room.

For now, the object and layout outputs of PQ-Transformer can take unrealistic values, yet there are physical constraints between them in real-world rooms. For example, a table might be near to a wall, but it can never overlap with the wall. In addition, the bottom of the table cannot be lower than the floor. Based on this fact, we design a novel physical constraint loss tailored for our quad representation for layouts, in order to discourage interference. It can help the network generate more precise and reasonable results. It is noteworthy that although there are physical constraints between most objects and walls, some types of objects do overlap with the walls, such as doors, windows and curtains. Therefore, our physical constraints are only designed for those types of objects which will never overlap with walls. We use the set to represent the corresponding object categories. Fig.3 illustrates the role of physical constraints.

Fig. 5: Qualitative prediction results on ScanNet. Objects are outlined in green while layout quads are outlined in blue.

We use a quad in the 3D Euclidean space to represent a wall, and the quad defines a 3D plane whose equation is:


Vector is exactly the normal vector of the plane. For dis-ambiguity, we make all normal vectors point to the center of the room manually. We could divide the 3D space into two parts using this plane. For a point with coordinate , if , the point is at the same side of the room center, otherwise, the point is out of the room. Fig.4 illustrates the situation that a bounding box (green) intersects with a wall (blue, right) and vertice B1 is in the room while vertice B2 is out of the room. For a 3D object box, we traverse its eight vertices, determine whether they intersect the walls using the plane equations.

For a vertice , the physical constraint loss we minimize takes the form of ReLU. However, imposing this loss on all objects and walls might cause wrong constraints. For example, in the left part of Fig.3, wall and sofa should not constrain each other since the bounding box of and the quad of actually do not intersect. But if we impose the loss equation above between and , it leads to a no-zero physical constraint loss. To avoid this kind of wrong constraints, we first determine whether the projection of a bounding box vertice is within the wall quad before calculating the physical constraint loss. We project the vertice onto the wall plane, and compare its projection with the quad size. The loss equation for a set of detection results with objects and quads is:


denotes a quad. means the operator projecting a point onto the plane that defines. means the sementic class of object and is the set of object classes to calculate physical constraint loss. The plane equation of -th quad is and indicates whether the projection of vertice is in : if it is, return 1; otherwise, return 0.

Iii-D Loss

First we denote the layer number of transformer decoder as . We get sets of detection results in total. Specifically, sets are generated from layers of the decoder and one set is generated from the proposal module. Then we calculate loss on each set of results and use the summation as the final loss. Losses on intermediate decoder outputs and proposal module output play the role of auxiliary supervision, which help PQ-Transformer converge. Let the loss of the -th set of detection results be , the total loss used in training is:


Here is the loss for voting vectors:


Here is the ground truth voting vector. indicates whether the point is inside a bounding box. If it is, the value is 1, otherwise it is 0. For each set of results:


is the loss between predicted bounding boxes and ground truth boxes, while is the loss between predicted quads and ground truth quads. They are calculated as below:


are loss weight parameters. We use cross entropy loss for all classification results like and . For regression results like and , we use smooth L1 loss. Detailed loss weight settings can be found in the supplementary material.

Iv Experiment

Iv-a Comparisons with State-of-the-art Methods

Evaluation Details. We validate PQ-Transformer on the widely-used indoor scene dataset ScanNet [34]. It contains 1.2K real-world RGB-D scans collected from hundreds of different rooms. It is annotated with semantic and instance segmentation labels for 18 object categories. In addition, SceneCAD[2] introduces a new dataset by adding 3D layout annotations to ScanNet, allowing large-scale data-driven training for layout estimation. The SceneCAD layout dataset contains 13.8k corners, 20.5K edges and 8.4K polygons. We first preprocess these annotations, choosing polygons which have 4 vertices and nearly horizontal normal vectors as the ground truth of wall quads during training. We use the official ScanNet data split. In later paragraphs and tables, single means that we train object detector and layout estimator separately, and joint represents our full PQ-Transformer architecture illustrated in Fig. 2.

Fig. 6: Qualitative comparisons on ScanNet. After adding the physical constraint loss, the bounding boxes no longer overlap with the walls (left), and the wrongly predicted bounding box disappears (right).

Layout estimation. We show our layout estimation results on ScanNet in Tab. I. SceneCAD[2]

uses a bottom-up pipeline to predict quads hierarchically. Contrasting with SceneCAD, our approach generates quad proposals directly and refines them with transformer. For comparison, we use the same evaluation metrics as SceneCAD does. As mentioned before, ceiling and floor polygons (not necessarily quads) are generated by connecting the upper and lower boundaries of predicted wall quads (see details in the supplementary material). Polygon corners are considered successfully detected if the predicted corner is within a radius of 40

from any ground truth corner. Similarly, predicted polygons are considered correct if composed by the same corner set as any ground truth polygon. As shown in the Tab.I, the room layout F1-score on ScanNet is significantly promoted from 37.9% to 57.9%. And if only considering wall quads, the F1-score is 70.9%. For joint detection, the F1-score also outperforms previous state-of-the-art by 17.9%.

Method F1-score (all) F1-score (wall only)
SceneCAD 37.9
Ours (joint) 55.8 68.7
Ours (single) 57.9 70.9

TABLE I: Layout estimation results on ScanNet.

3D object detection. We compare our 3D object detection results with previous state-of-the-arts in Tab.VI. L6 means 6 attention layers and O256 means 256 proposals. HGNet [35] exploits a graph convolution network based upon hierarchical modelling, for 3D detection. VoteNet [3] uses point-wise voting vectors to generate object proposals. Group-Free [33] is an attention-based detector that generates object proposals with k-nearest point sampling. GSDN [37] uses a fully convolutional sparse-dense hybrid network to generate the support for object proposals. H3DNet [36] predicts a diverse set of geometric primitives and converts them into object proposals. Following the standard evaluation protocol, we use mean Average Precision (mAP) to evaluate PQ-Transformer on object detection. Tab.VI shows that our approach performs comparable with the state-of-the-art methods.

Method mAP@0.25
VoteNet [3] 58.7
HGNet [35] 61.3
GSDN [37] 62.8
H3DNet [36] 67.2
Group-Free [33] (L6, O256) 67.3
Ours (joint, L6, O256) 66.9
Ours (single, L6, O256) 67.2
TABLE II: 3D object detection results on ScanNet.

Iv-B Ablation Study

Physical constraint loss. To investigate the necessity of physical constraint loss, we train two models with and without it. We demonstrate the results in Tab.III. The mAP of object detection rises from 64.4% to 66.9% after adding physical constraint loss and the F1-score of layout estimation increases from 54.7% to 55.8%, which clearly shows the effectiveness of our physical constraint loss. We also show the number of collisions between objects and walls with two models in Tab.III. One collision means a vertex of the object bounding box is out of the room. The sharp drop in the number of collisions shows that our physical constraint loss discourages object-layout interference successfully.

As demonstrated in Fig.6, the object detection results are more reasonable with physical constraint loss. In the top-left sample, the bounding box of the toilet in the red box intersects with the wall, which is impossible in the real-world. While training with the physical constraint loss, this error no longer exists. In the top-right sample, influenced by the point cloud outside the room, there is a meaningless bounding box there when training without physical constraint loss. And it vanishes after adding the loss.

Object (mAP) Layout (F1-score) No. Collisions
w/o 64.4 54.7 7208
w/ 66.9 55.8 9
TABLE III: Joint prediction results on ScanNet with or without using the physical constraint loss.

Architecture. Since how to design a single transformer for two structured prediction tasks remains unclear, we design experiments to compare several alternative architectures which are shown in Tab.IV. Joint (one proposal) represents the model trained with a single proposal module for both object detection and layout estimation. In this case, our two tasks would compete for bottom-up proposals. And our architecture depicted in Fig.2 is denoted as joint (two proposals). Tab.IV shows that although single has achieved the best results, its runtime speed is very slow. Joint (one proposal) has the best efficiency, but its performance is obviously poor. Our model has achieved comparable quantitative results with single while the speed is close to joint (one proposal). This verifies the effectiveness of our architecture. We believe this insight is useful for similar multi-task transformer architectures: separating different tasks at the proposal stage, rather than inputs or prediction heads.

Architecture Speed (FPS) Object mAP Layout F1-score
single 4.29 67.2 57.9
joint (one proposal) 9.52 44.6 52.4
joint (two proposals) 8.91 66.9 55.8
TABLE IV: Architecture design comparisons.

Iv-C Qualitative Results and Discussion

Fig. 7: Failure cases on ScanNet. PQ-Transformer fails to detect the two partition walls in the middle of the room (left) and the inclined wall (right).

Fig.5 shows our joint parsing results on ScanNet. It is manifest from Fig.5 that our approach can predict the wall quads precisely even if the room is non-rectangular and detect the bounding boxes of most objects successfully. The differences between our object detection results and ground truth mainly arise from annotation ambiguity and duplicate detection. To be more exact, in the first column of Fig.5, our approach detects the desk in the bottom-left corner while it isn’t annotated in the ground truth. And in the second column, our approach recognizes the corner sofa as two separate sofas while ground truth takes it as a whole one. More qualitative results are provided in the supplementary material. Considering the diversity of these scenes, we believe PQ-Transformer is accurate enough for various robotics applications.

Our layout estimation approach still has limitations. Fig.7 shows some failure cases on ScanNet. In the first column, our approach fails to detect the two partition walls in the middle of the room. And we are unable to detect the inclined wall on the right side of the room, in the second column.

V Conclusion

In this study, we develop the first attention-based neural network to predict 3D objects and layout quads simultaneously, taking only point clouds as inputs. We introduce a novel representation for layout: a set of 3D quads. Along with it, we propose a tailored physical constraint loss function that discourages object-layout interference. A multi-task transformer architecture that strikes the balance between accuracy and efficiency is proposed. We evaluate PQ-Transformer on the public benchmark ScanNet and show that: 1) The new physical constraint loss can improve strong baselines. 2) The layout F1-score on ScanNet is significantly boosted from 37.9% to 57.9%. We believe our method is useful for robotics applications as the final model runs at a quasi-real-time (8.91 FPS) rate without efficiency-oriented optimization.


  • [1]

    Avetisyan, A., Khanova, T., Choy, C., Dash, D., Dai, A. and Nießner, M., 2020. SceneCAD: Predicting object alignments and layouts in rgb-d scans. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16 (pp. 596-612). Springer International Publishing.

  • [2]

    Choy, C., Gwak, J. and Savarese, S., 2019. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3075-3084).

  • [3] Qi, C.R., Litany, O., He, K. and Guibas, L.J., 2019. Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9277-9286).
  • [4] Hedau, V., Hoiem, D. and Forsyth, D., 2009, September. Recovering the spatial layout of cluttered rooms. In 2009 IEEE 12th international conference on computer vision (pp. 1849-1856). IEEE.
  • [5] Zhu, Y., Gao, T., Fan, L., Huang, S., Edmonds, M., Liu, H., Gao, F., Zhang, C., Qi, S., Wu, Y.N. and Tenenbaum, J.B., 2020. Dark, beyond deep: A paradigm shift to cognitive ai with humanlike common sense. Engineering, 6(3), pp.310-345.
  • [6] Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K. and Zhu, S.C., 2015. Scene understanding by reasoning stability and safety. International Journal of Computer Vision, 112(2), pp.221-238.
  • [7] Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A.A., Laptev, I. and Sivic, J., 2014. People watching: Human actions as a cue for single view geometry. International journal of computer vision, 110(3), pp.259-274.
  • [8] Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S. and Fitzgibbon, A., 2011, October. Kinectfusion: Real-time dense surface mapping and tracking. In 2011 10th IEEE international symposium on mixed and augmented reality (pp. 127-136). IEEE.
  • [9] Han, L. and Fang, L., 2018, June. FlashFusion: Real-time Globally Consistent Dense 3D Reconstruction using CPU Computing. In Robotics: Science and Systems (Vol. 1, No. 6, p. 7).
  • [10] Hoiem, D., Efros, A.A. and Hebert, M., 2007. Recovering surface layout from an image. International Journal of Computer Vision, 75(1), pp.151-172.
  • [11] Hedau, V., Hoiem, D. and Forsyth, D., 2009, September. Recovering the spatial layout of cluttered rooms. In 2009 IEEE 12th international conference on computer vision (pp. 1849-1856). IEEE.
  • [12] Lee, D.C., Hebert, M. and Kanade, T., 2009, June. Geometric reasoning for single image structure recovery. In 2009 IEEE conference on computer vision and pattern recognition (pp. 2136-2143). IEEE.
  • [13] Schwing, A.G., Fidler, S., Pollefeys, M. and Urtasun, R., 2013. Box in the box: Joint 3d layout and object reasoning from single images. In Proceedings of the IEEE International Conference on Computer Vision (pp. 353-360).
  • [14] Choi, W., Chao, Y.W., Pantofaru, C. and Savarese, S., 2015. Indoor scene understanding with geometric and semantic contexts. International Journal of Computer Vision, 112(2), pp.204-220.
  • [15] Zhao, Y. and Zhu, S.C., 2013. Scene parsing by integrating function, geometry and appearance models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3119-3126).
  • [16]

    Qin, Z., Wang, J. and Lu, Y., 2019, July. Monogrnet: A geometric reasoning network for monocular 3d object localization. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 8851-8858).

  • [17] Huang, S., Chen, Y., Yuan, T., Qi, S., Zhu, Y. and Zhu, S.C., 2019. Perspectivenet: 3d object detection from a single rgb image via perspective points. arXiv preprint arXiv:1912.07744.
  • [18] Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W.T. and Tenenbaum, J.B., 2017. Marrnet: 3d shape reconstruction via 2.5 d sketches. arXiv preprint arXiv:1711.03129.
  • [19] Chen, Z., Tagliasacchi, A. and Zhang, H., 2020. Bsp-net: Generating compact meshes via binary space partitioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 45-54).
  • [20] Mallya, A. and Lazebnik, S., 2015. Learning informative edge maps for indoor scene layout prediction. In Proceedings of the IEEE international conference on computer vision (pp. 936-944).
  • [21] Dasgupta, S., Fang, K., Chen, K. and Savarese, S., 2016. Delay: Robust spatial layout estimation for cluttered indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 616-624).
  • [22] Zhao, H., Lu, M., Yao, A., Guo, Y., Chen, Y. and Zhang, L., 2017. Physics inspired optimization on semantic transfer features: An alternative method for room layout estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 10-18).
  • [23] Fernandez-Labrador, C., Facil, J.M., Perez-Yus, A., Demonceaux, C., Civera, J. and Guerrero, J.J., 2020. Corners for layout: End-to-end layout recovery from 360 images. IEEE Robotics and Automation Letters, 5(2), pp.1255-1262.
  • [24] Huang, S., Qi, S., Zhu, Y., Xiao, Y., Xu, Y. and Zhu, S.C., 2018. Holistic 3d scene parsing and reconstruction from a single rgb image. In Proceedings of the European conference on computer vision (ECCV) (pp. 187-203).
  • [25]

    Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y. and Zhu, S.C., 2019. Holistic++ scene understanding: Single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8648-8657).

  • [26] Nie, Y., Han, X., Guo, S., Zheng, Y., Chang, J. and Zhang, J.J., 2020. Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 55-64).
  • [27] Song, S. and Xiao, J., 2014, September. Sliding shapes for 3d object detection in depth images. In European conference on computer vision (pp. 634-651). Springer, Cham.
  • [28] Song, S. and Xiao, J., 2016. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 808-816).
  • [29] Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M. and Funkhouser, T., 2017. Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1746-1754).
  • [30] Zhang, J., Zhao, H., Yao, A., Chen, Y., Zhang, L. and Liao, H., 2018. Efficient semantic scene completion network with spatial group convolution. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 733-749).
  • [31] Graham, B., Engelcke, M. and Van Der Maaten, L., 2018. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9224-9232).
  • [32] Qi, C.R., Yi, L., Su, H. and Guibas, L.J., 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413.
  • [33] Liu, Z., Zhang, Z., Cao, Y., Hu, H. and Tong, X., 2021. Group-Free 3D Object Detection via Transformers. arXiv preprint arXiv:2104.00678.
  • [34] Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T. and Nießner, M., 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828-5839).
  • [35] Chen, J., Lei, B., Song, Q., Ying, H., Chen, D.Z. and Wu, J., 2020. A hierarchical graph network for 3D object detection on point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 392-401).
  • [36] Zhang, Z., Sun, B., Yang, H. and Huang, Q., 2020, August. H3dnet: 3d object detection using hybrid geometric primitives. In European Conference on Computer Vision (pp. 311-329). Springer, Cham.
  • [37] Gwak, J., Choy, C. and Savarese, S., 2020. Generative sparse detection networks for 3d single-shot object detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16 (pp. 297-313). Springer International Publishing.


V-a Method Details

This section provides additional implementation details of PQ-Transformer. First, we show network architecture in section A-1 and layout estimation details in section A-2. Then in section A-3, we discuss our implementation of physical constraint loss. After that, we elaborate on our loss weights setting in section A-4. And finally, we provide training details in section A-5.

V-A1 Architecture Specification Details

The point cloud feature learning backbone is implemented with modules in PointNet++[1]

. It consists of 4 set abstraction layers and two feature propagation layers. For each set abstraction layer, the input point cloud size is down-sampled to 2048, 1024, 512 and 256 respectively. And the two feature propagation layers up-sample the point cloud features to 512 and 1024 by applying trilinear interpolation on the input features.

We generate proposals for object detection and for layout estimation where . Then we use a transformer decoder with 6 attention layers to refine proposals. The head number of it is 8.

Following [2], we parameterize an oriented 3D bounding box as a vector of size . The first two are objectness scores and the next three are center regression results. H is the number of heading bins. We predict a classification score and a regression offset for each heading bin. S is the number of size bins. Similarly, we predict a classification result and three regression results (height, width and length) for each size bin. And C is the number of semantic classes. In ScanNet[5], we set H = 12 and S = C = 18.

V-A2 Layout Estimation Details: Quad NMS and Ceiling/Floor Evaluation

The layout estimation result obtained by the prediction head contains quads. To remove duplicate quads, we give each quad a fixed width to form a flat cuboid. In our implementation, we set the width to 10 so that we could process these flat cuboids easily with 3D NMS. We set the IoU threshold of NMS to 0.25, during training.

Through 3D NMS and quadness filtering (only consider quads with quadness scores >0.5), we get quads. We use them to generate ceiling and floor. First, We initialize a list for the ceiling. Then we traverse quads, adding the upper edge of the quad into . After that we iterate through . If the distance between vertices on two edges in is less than 40 , the two vertices will be merged by averaging. After merging, becomes the edge estimation of the ceiling. which defines a polygon. The merging procedure is illustrated in Fig 8. Similarly, we calculate floor edges using the lower edges of quads in the same way. As mentioned in the main paper, we use the same evaluation metrics as SceneCAD[6] does. Two vertices are considered the same if they are within 40 of each other. And an edge is considered correct if composed by the same two vertices with any ground truth edge. So if the ceiling or floor is composed of the same edges with any ground truth polygon, it is considered successfully estimated.

Fig. 8: Illustration of the merging procedure in Ceiling/Floor evaluation.

V-A3 Physical Constraint Implementation Details

The computational complexity of the generic physical constraint loss we introduced in the main paper is high. Because we have to traverse eight vertices of all object bounding boxes and all quads. Considering the fact that the wall quads are nearly vertical, we only calculate a 2D version of physical constraint loss in practice to reduce computation. We transform all the bounding boxes and quads into top-down view. Then the bounding boxes become rectangles and the quads become line segments. We represent a line segment with equation and a length . And for a vertice , the physical constraint loss between it and the line segment becomes ReLU. To further improve efficiency, we accelerate iteration of vertices with matrix operations. We use to describe vertices whose coordinates are of dimension . Q = represents the normal vector of the line segment. indicates whether the projection of vertices are in the line segment. The loss between n vertices and the line segment (quad) is:


where denotes element-wise product and sum means summation of all elements in the matrix. We compare the training time of different implementations of in Tab. V, which shows the efficiency of the our implementation.

Time cost /s
w/o 0.24
w/ (trivial) 1.77
w/ (efficient) 0.32
TABLE V: Time cost for one batch (8 samples) training.

V-A4 Loss Balancing Details

PQ-Transformer is trained with a multi-task loss in an end-to-end fashion. As mentioned in the main paper, the object loss is denoted as:


We use the loss weights to balance different loss functions as follows:

The quad loss is denoted as:


And the weights for are:

Specifically, the detailed form of is:


where is loss for bounding box center, is heading bin classification loss, is heading bin regression loss, and and are classification score loss and regression loss for box size bin respectively. And the weights are as follows:

V-A5 Training Details

We train PQ-Transformer with with three NVIDIA GeForce RTX 3090 GPUs and test it on a single GPU. The network is trained with an AdamW optimizer in an end-to-end fashion. And we sample 40K vertices from ScanNet as our input point clouds, setting batch size per GPU to 8. We spend 600 epochs to train the model.

V-B More Qualitative Results

We show more qualitative results of PQ-Transformer on ScanNet. The results are shown in Fig.10. Considering the diversity of scenes and objects in these cases, we believe our approach has achieved accurate and robust object detection and layout estimation.

V-C ScanNet Per-category Evaluation

Tab. VI demonstrates per-category average precision on ScanNet with a 0.25 IoU threshold. It shows that PQ-Transformer performs comparable with state-of-the-art methods and performs better in some categories.

Fig. 9: More qualitative results.
Fig. 10: More qualitative results (cont.).
Method bathtub bed bshelf cabinet chair counter curtain desk door mAP@0.25
VoteNet [2] 92.1 87.9 44.3 36.3 88.7 56.1 47.2 71.7 47.3 58.7
H3DNet [3] 92.5 88.6 54.9 49.4 91.8 62.0 57.3 75.9 55.8 67.2
Group-Free (L6,O256) [4] 92.5 86.2 48.5 54.1 92.0 59.4 64.2 80.4 55.8 67.3
Ours (joint, one proposal) 50.5 79.3 28.3 35.7 75.8 17.5 41.2 60.0 27.8 44.6
Ours (joint, w/o ) 90.9 89.6 43.0 42.6 87.4 61.4 69.3 77.5 51.7 64.4
Ours (joint) 90.0 94.4 65.3 55.2 89.5 51.3 58.7 87.5 58.4 66.9
Ours (single) 88.5 94.4 54.2 50.0 88.2 55.3 64.6 84.3 60.5 67.2
Method gbin picture fridge sink scurtain sofa table toilet window
VoteNet [2] 37.2 7.8 45.4 54.7 57.1 89.6 58.8 94.9 38.1
H3DNet [3] 53.6 18.6 57.2 67.4 75.3 90.2 64.9 97.9 51.9
Group-Free (L6,O256) [4] 55.0 15.0 57.2 76.8 76.3 84.8 67.8 97.6 46.9
Ours (joint, one proposal) 26.1 3.5 28.5 65.3 48.8 76.8 20.6 89.0 28.6
Ours (joint, w/o ) 45.8 15.5 56.0 64.5 79.0 96.6 47.0 96.3 44.9
Ours (joint) 53.5 14.9 60.0 62.1 65.4 96.9 54.7 97.6 48.1
Ours (single) 54.1 21.8 54.2 65.8 81.1 90.0 51.6 98.4 51.9
TABLE VI: Per-category 3D object detection results on ScanNet.