PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points

12/16/2019 ∙ by Siyuan Huang, et al. ∙ 20

Detecting 3D objects from a single RGB image is intrinsically ambiguous, thus requiring appropriate prior knowledge and intermediate representations as constraints to reduce the uncertainties and improve the consistencies between the 2D image plane and the 3D world coordinate. To address this challenge, we propose to adopt perspective points as a new intermediate representation for 3D object detection, defined as the 2D projections of local Manhattan 3D keypoints to locate an object; these perspective points satisfy geometric constraints imposed by the perspective projection. We further devise PerspectiveNet, an end-to-end trainable model that simultaneously detects the 2D bounding box, 2D perspective points, and 3D object bounding box for each object from a single RGB image. PerspectiveNet yields three unique advantages: (i) 3D object bounding boxes are estimated based on perspective points, bridging the gap between 2D and 3D bounding boxes without the need of category-specific 3D shape priors. (ii) It predicts the perspective points by a template-based method, and a perspective loss is formulated to maintain the perspective constraints. (iii) It maintains the consistency between the 2D perspective points and 3D bounding boxes via a differentiable projective function. Experiments on SUN RGB-D dataset show that the proposed method significantly outperforms existing RGB-based approaches for 3D object detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

If one hopes to achieve a full understanding of a system as complicated as a nervous system, …, or even a large computer program, then one must be prepared to contemplate different kinds of explanation at different levels of description that are linked, at least in principle, into a cohesive whole, even if linking the levels in complete details is impractical. — David Marr Marr (1982), pp. 20–21

In a classic view of computer vision, David Marr 

Marr (1982) conjectured that the perception of a 2D image is an explicit multi-phase information process, involving (i) an early vision system of perceiving textures Julesz (1962); Zhu et al. (1998) and textons Julesz (1981); Zhu et al. (2005) to form a primal sketch as a perceptually lossless conversion from the raw image Guo et al. (2003, 2007), (ii) a mid-level vision system to construct 2.1D (multiple layers with partial occlusion) Nitzberg and Mumford (1990); Wang and Adelson (1993, 1994) and 2.5D Marr and Nishihara (1978) sketches, and (iii) a high-level vision system that recovers the full 3D Binford (1971); Brooks (1981); Kanade (1981). In particular, he highlighted the importance of different levels of organization and the internal representation Broadbent (1985).

In parallel, the school of Gestalt Laws (Wertheimer, 1912; Wagemans et al., 2012a, b; Köhler, 1920, 1938; Wertheimer, 1923, 1938; Koffka, 2013) and perceptual organization Lowe (2012); Pentland (1987) aims to resolve the 3D reconstruction problem from a single RGB image without forming the depth cues; but rather, they often use some sorts of priors—groupings and structural cues Waltz (1975); Barrow and Tenenbaum (1981) that are likely to be invariant over wide ranges of viewpoints Lowe (1987), resulting in the birth of the SIFT feature Lowe (2004). Later, from a Bayesian perspective at a scene level, such priors, independent of any 3D scene structures, were found in the human-made scenes, known as the Manhattan World assumption Coughlan and Yuille (2003). Importantly, further studies found that such priors help to improve object detection Coughlan and Yuille (1999).

In this paper, inspired by these two classic schools in computer vision, we seek to test the following two hypotheses using modern computer vision methods: (i) Could an intermediate representation facilitate modern computer vision tasks? (ii) Is such an intermediate representation a better and more invariant prior compared to the priors obtained directly from specific tasks?

In particular, we tackle the challenging task of 3D object detection from a single RGB image. Despite the recent success in 2D scene understanding (

e.g., Ren et al. (2015); He et al. (2017), there is still a significant performance gap for 3D computer vision tasks based on a single 2D image. Recent modern approaches directly regress the 3D bounding boxes Chen et al. (2016); Mousavian et al. (2017); Huang et al. (2018a) or reconstruct the 3D objects with specific 3D object priors Kundu et al. (2018); Huang et al. (2018b); Yao et al. (2018); He and Soatto (2019). In contrast, we propose an end-to-end trainable framework, PerspectiveNet, that sequentially estimates the 2D bounding box, 2D perspective points, and 3D bounding box for each object with a local Manhattan assumption Xiao and Furukawa (2014), in which the perspective points serve as the intermediate representation, defined as the 2D projections of local Manhattan 3D keypoints to locate an object.

Figure 1: Traditional 3D object detection methods directly estimate (c) the 3D object bounding boxes from (a) the 2D bounding boxes, which suffer from the uncertainties between the 2D image plane and the 3D world. The proposed PerspectiveNet utilizes (b) the 2D perspective points as the intermediate representation to bridge the gap. The perspective points are the 2D perspective projection of the 3D bounding box corners, containing rich 3D information (e.g., positions, orientations). The red dots indicate the perspective points of the bed that are challenging to emerge based on the visual features, but could be inferred by the context (correlations and topology) among other perspective points.

The proposed method offers three unique advantages. First, the use of perspective points as the intermediate representation bridges the gap between 2D and 3D bounding boxes without utilizing any extra category-specific 3D shape priors. As shown in Figure 1, it is often challenging for learning-based methods to estimate the 3D bounding boxes from 2D images directly; regressing 3D bounding boxes from 2D input is a highly under-constrained problem and can be easily influenced by appearance variations of shape, texture, lighting, and background. To alleviate this issue, we adopt the perspective points as an intermediate representation to represent the local Manhattan frame that each 3D object aligns with. Intuitively, the perspective points of an object are 3D geometric constraints in the 2D space. More specifically, the 2D perspective points for each object are defined as the perspective projection of the 3D object bounding box (concatenated with its center), and each 3D box aligns within a 3D local Manhattan frame. These perspective points are fused into the 3D branch to predict the 3D attributes of the 3D bounding boxes.

Second, we devise a template-based method to efficiently and robustly estimate the perspective points. Existing methods Newell et al. (2016); Lee et al. (2017); Zou et al. (2018); He et al. (2017); Suwajanakorn et al. (2018)

usually exploit heatmap or probability distribution map as the representation to learn the location of visual points (

e.g., object keypoint, human skeleton, room layout), relying heavily on the view-dependent visual features, thus insufficient to resolve occlusions or large rotation/viewpoint changes in complex scenes; see an example in Figure 1 (b) where the five perspective points (in red) are challenging to emerge from pure visual features but could be inferred by the correlations and topology among other perspective points. To tackle this problem, we treat each set of 2D perspective points as the low dimensional embedding of its corresponding set of 3D points with a constant topology; such an embedding is learned by predicting the perspective points as a mixture of sparse templates. A perspective loss is formulated to impose the perspective constraints; the details are described in § 3.2.

Third, the consistency between the 2D perspective points and 3D bounding boxes can be maintained by a differentiable projective function; it is end-to-end trainable, from the 2D region proposals, to the 2D bounding boxes, to the 2D perspective points, and to the 3D bounding boxes.

In the experiment, we show that the proposed PerspectiveNet outperforms previous methods with a large margin on SUN RGB-D dataset Song et al. (2015), demonstrating its efficacy on 3D object detection.

0.97

2 Related Work

3D object detection from a single image

Detecting 3D objects from a single RGB image is a challenging problem, particularly due to the intrinsic ambiguity of the problem. Existing methods could be categorized into three streams: (i) geometry-based methods that estimate the 3D bounding boxes with geometry and 3D world priors Zhao and Zhu (2011, 2013); Choi et al. (2013); Lin et al. (2013); Zhang et al. (2014); (ii) learning-based methods that incorporate category-specific 3D shape prior Izadinia et al. (2017); Huang et al. (2018b); He and Soatto (2019) or extra 2.5D information (depth, surface normal, and segmentation) Kundu et al. (2018); Yao et al. (2018); Xu and Chen (2018)

to detect 3D bounding boxes or reconstruct the 3D object shape; and (iii) deep learning methods that directly estimates the 3D object bounding boxes from 2D bounding boxes 

Chen et al. (2015, 2016); Mousavian et al. (2017); Huang et al. (2018a). To make better estimations, various techniques have been devised to enforce consistencies between the estimated 3D and the input 2D image. Huang et al. (2018a) proposed a two-stage method to learn the 3D objects and 3D layout cooperatively. Kundu et al. (2018) proposed a 3D object detection and reconstruction method using category-specific object shape prior by render-and-compare. Different from these methods, the proposed PerspectiveNet is a one-stage end-to-end trainable 3D object detection framework using perspective points as an intermediate representation; the perspective points naturally bridge the gap between the 2D and 3D bounding boxes without any extra annotations, category-specific 3D shape priors, or 2.5D maps.

Manhattan World assumption

Human-made environment, from the layout of a city to structures such as buildings, room, furniture, and many other objects, could be viewed as a set of parallel and orthogonal planes, known as the Manhattan World (MW) assumption Coughlan and Yuille (1999). Formally, it indicates that most human-made structures could be approximated by planar surfaces that are parallel to one of the three principal planes of a common orthogonal coordinate system. This strict Manhattan World assumption is later extended by a Mixture of Manhattan Frame (MMF) Straub et al. (2014) to represent more complex real-world scenes (e.g., city layouts, rotated objects). In literature, MW and MMF have been adopted in vanish points (VPs) estimation and camera calibration Schindler and Dellaert (2004); Kroeger et al. (2015), orientation estimation Bosse et al. (2003); Straub et al. (2015); Ghanem et al. (2015), layout estimation Hedau et al. (2009); Lee et al. (2009); Hedau et al. (2010); Schwing et al. (2012); Zou et al. (2018), and 3D scene reconstruction Delage et al. (2007); Furukawa et al. (2009); Xiao et al. (2013); Xiao and Furukawa (2014); Ren and Sudderth (2016); Liu et al. (2017). In this work, we extend the MW to local Manhattan assumption where the cuboids are aligned with the vertical (gravity) direction but with arbitrary horizontal orientation (also see Xiao and Furukawa (2014)), and perspective points are adopted as the intermediate representation for 3D object detection.

Intermediate 3D representation

Intermediate 3D representations are bridges that narrow the gap and maintain the consistency between the 2D image plane and 3D world. Among them, 2.5D sketches have been broadly used in reconstructing the 3D shapes Wu et al. (2017); Zhu et al. (2018); Zhang et al. (2018) and 3D scenes Tulsiani et al. (2018); Huang et al. (2018b). Other recent alternative intermediate 3D representations include: (i) Wu et al. (2016) uses pre-annotated and category-specific object keypoints as an intermediate representation, and (ii) Tekin et al. (2018) uses projected corners of 3D bounding boxes in learning the 6D object pose. In this paper, we explore the perspective points as an intermediate representation of 2D and 3D bounding boxes, and provide an efficient learning framework for 3D object detection.

3 Learning Perspective Points for 3D Object Detection

3.1 Overall Architecture

As shown in Figure 2

, the proposed PerspectiveNet contains a backbone architecture for feature extraction over the entire image, a region proposal network (RPN) 

Ren et al. (2015) that proposes regions of interest (RoIs), and a network head including three region-wise parallel branches. For each proposed box, its RoI feature is fed into the three network branches to predict: (i) the object class and the 2D bounding box offset, (ii) the 2D perspective points (projected 3D box corners and object center) as a weighted sum of predicted perspective templates, and (iii) the 3D box size, orientation, and its distance from the camera. Detected 3D boxes are reconstructed by the projected object center, distance, box size, and rotation. The overall architecture of the PerspectiveNet resembles the R-CNN structure, and we refer readers to Ren et al. (2015); Girshick (2015); He et al. (2017) for more details of training R-CNN detectors.

Figure 2: The proposed framework of the PerspectiveNet. Given an RGB Image, the backbone of PerspectiveNet extracts global features and propose candidate 2D bounding boxes (RoIs). For each proposed box, its RoI feature is fed into three network branches to predict: (i) the object class and the 2D box offset, (ii) 2D perspective templates (projected 3D box corners and object center) and the corresponding coefficients, and (iii) the 3D box size, orientation, and its distance from the camera. Detected 3D boxes are reconstructed by the projected object center, distance, box size, and rotation. By projecting the detected 3D boxes to 2D and comparing them with 2D perspective points, the network imposes and learns a consistency between the 2D inputs and 3D estimations.

During training, we define a multi-task loss on each proposed RoI as

(1)

where the classification loss and 2D bounding box loss belong to the 2D bounding box branch and are identical to those defined in 2D R-CNNs Ren et al. (2015); He et al. (2017). and are defined on the perspective point branch (§ 3.2), is defined on the 3D bounding box branch (see § 3.3), and the is defined on maintaining the 2D-3D projection consistency (see § 3.4).

0.98

3.2 Perspective Point Estimation

The perspective point branch estimates the set of 2D perspective points for each RoI. Formally, the 2D perspective points of an object are the 2D projections of local Manhattan 3D keypoints to locate that object, and they satisfy certain geometric constraints imposed by the perspective projection. In our case, the perspective points (Figure 1(b)) include the 2D projections of the 3D bounding box corners and the 3D object center. The perspective points are predicted using a template-based regression and learned by a mean squared error and a perspective loss detailed below.

3.2.1 Template-based Regression

Most of the existing methods Newell et al. (2016); Lee et al. (2017); Zou et al. (2018); He et al. (2017); Suwajanakorn et al. (2018) estimate visual keypoints with heatmaps, where each map predicts the location for a certain keypoint. However, predicting perspective points by heatmaps has two major problems: (i) Heatmap prediction for different keypoints is independent, thus fail to capture the topology correlations among the perspective points. (ii) Heatmap prediction for each keypoint relies heavily on the visual feature such as corners, which may be difficult to detect (see an example in Figure 1(b)). In contrast, each set of 2D perspective points can be treated as a low dimensional embedding of a set of 3D points with a particular topology, thus inferring such points relies more on the relation and topology among the points instead of just the visual features.

To tackle these problems, we avoid dense per-pixel predictions. Instead, we estimate the perspective points by a mixture of sparse templates Olshausen and Field (1996); Wu et al. (2010). The sparse templates are more robust when facing unfamiliar scenes or objects. Ablative experiments show that the proposed template-based method provides a more accurate estimation of perspective points than heatmap-based methods; see § 5.1.

Specifically, we project both the 3D object center and eight 3D bounding box corners to 2D with camera parameters to generate the ground-truth 2D perspective points . Since a portion of the perspective points usually lies out of the RoI, we calculate the location of the perspective points in an extended (doubled) size of RoI and normalize the locations to .

We predict the perspective points by a linear combination of templates; see Figure 3. The perspective point branch has a dimensional output for the templates , and a dimensional output for the coefficients , where denotes the number of templates for each class and denotes the number of object classes. The templates is scaled to

by a sigmoid nonlinear function, and the coefficients

is normalized by a softmax function. The estimated perspective points can be computed by a linear combination:

(2)

1

Figure 3: Perspective point estimation. (a) The perspective points are estimated by a mixture of templates through a linear combination. Each template encodes geometric cues including orientations and viewpoints. (b) The perspective loss enforces each set of 2D perspective points to be the perspective projection of a (vertical) 3D cuboid. For a vertical cuboid, the projected vertical edges (i.e., , , , and ) should be parallel or near parallel (under small camera tilting angles). For 3D parallel lines that are perpendicular to the gravity direction, the vanishing points of their 2D projections should coincide (e.g., and ).

The template design is both class-specific and instance-specific: (i) Class-specific: we decouple the prediction of the perspective point and the object class, allowing the network to learn perspective points for every class without competition among classes. (ii) Instance-specific: the templates are inferred for each RoI; hence, they are specific to each object instance. The templates are automatically learned for each object instance from data with the end-to-end learning framework; thus, both the templates and coefficients for each instance are optimizable and can better fit the training data.

The average mean squared error (MSE) loss is defined as . For an RoI associated with ground-truth class , is only defined on the ’s perspective points during training; perspective point outputs from other classes do not contribute to the loss. In inference, we rely on the dedicated classification branch to predict the class label to select the output perspective points.

3.2.2 Perspective Loss

Under the assumption that each 3D bounding box aligns with a local Manhattan frame, we regularize the estimation of the perspective points to satisfy the constraint of perspective projection. Each set of mutually parallel lines in 3D can be projected into 2D as intersecting lines; see Figure 3 (b). These intersecting lines should converge at the same vanishing point. Therefore, the desired algorithm would penalize the distance between the intersection points from the two sets of intersecting lines. For example in Figure 3 (b), we select line and line as a pair of lines, and as another, and compute the distance between their intersection point and . Additionally, since we assume each 3D local Manhattan frame aligns with the vertical (gravity) direction, we enforce the edges in gravity direction (i.e., , , , and

) to be parallel by penalizing the large slope variance.

The perspective loss is computed as , where penalizes the slope variance in gravity direction, and penalize the intersection point distance for the two perpendicular directions along the gravity direction.

3.3 3D Bounding Box Estimation

Estimating 3D bounding boxes is a two-step process. In the first step, the 3D branch estimates the 3D attributes, including the distance between the camera center and the 3D object center, as well as the 3D size and orientation following Huang et al. (2018a). Since the perspective point branch encodes rich 3D geometric features, the 3D attribute estimator aggregates the feature from perspective point branch with a soft gated function between to improve the prediction. The gated function serves as a soft-attention mechanism that decides how much information from perspective points should contribute to the 3D prediction.

In the second step, with the estimated projected 3D bounding boxes center (i.e., the first estimated perspective point) and the 3D attributes, we compose the 3D bounding boxes by the inverse projection from the 2D image plane to the 3D world following Huang et al. (2018a) given camera parameters.

The 3D loss is computed by the sum of individual losses of 3D attributes and a joint loss of 3D bounding box .

1

3.4 2D-3D Consistency

In contrast to prior work Wu et al. (2016); Rezende et al. (2016); Yan et al. (2016); Mousavian et al. (2017); Wu et al. (2017); Huang et al. (2018a) that enforces the consistency between estimated 3D objects and 2D image, we devise a new way to impose a re-projection consistency loss between 3D bounding boxes and perspective points. Specifically, we compute the 2D projected perspective points by projecting the 3D bounding box corners back to 2D image plane and computing the distance with respect to ground-truth perspective points . Comparing with prior work to maintain the consistency between 2D and 3D bounding boxes by approximating the 2D projection of 3D bounding boxes Mousavian et al. (2017); Huang et al. (2018a), the proposed method uses the exact projection of projected 3D boxes to establish the consistency, capturing a more precise 2D-3D relationship.

Figure 4: Qualitative results (top 50%). For every three columns as a group: (Left) The RGB image with 2D detection results. (Middle) The RGB image with estimated perspective points. (Right) The results in 3D point cloud; point cloud is used for visualization only.

4 Implementation Details

Network Backbone

Inspired by He et al. (2017), we use the combination of residual network (ResNet) He et al. (2016) and feature pyramid network (FPN) Lin et al. (2017) to extract the feature from the entire image. A region proposal network (RPN) Ren et al. (2015) is used to produce object proposals (i.e., RoI). A RoIAlign He et al. (2017) module is adopted to extract a smaller features map () for each proposal.

Network Head

The network head consists of three branches, and each branch has its individual feature extractor and predictor. Three feature extractors have the same architecture of two fully connected (FC) layers; each FC layer is followed by a ReLU function. The feature extractors take the

dimensional RoI features as the input and output a 1024 dimensional vector.

The predictor in the 2D branch has two separate FC layers to predict a dimensional object class probabilities and a dimensional 2D bounding box offset. The predictor in the perspective point branch predicts dimensional templates and

dimensional coefficients with two FC layers and their corresponding nonlinear activation functions (

i.e

., sigmoid, softmax). The soft gate in the 3D branch consists of an FC layer (1024-1) and a sigmoid function to generate the weight for feature aggregation. The predictor in the 3D branch consists of three FC layers to predict the size, the distance from the camera, and the orientation of the 3D bounding box.

1

Figure 5: Precision-Recall (PR) curves for 3D object detection on SUN RGB-D

5 Experiments

Dataset  We conduct comprehensive experiments on SUN RGB-D Song et al. (2015) dataset. The SUN RGB-D dataset has a total of 10,335 images, in which 5,050 are test images. It has a rich annotation of scene categories, camera pose, and 3D bounding boxes. We evaluate the 3D object detection results of the proposed PerspectiveNet, make comparisons with the state-of-the-art methods, and further examine the contribution of each module in ablative experiments.

Experimental Setup  To prepare valid data for training the proposed model, we discard the images with no 3D objects or incorrect correspondence between 2D and 3D bounding boxes, resulting 4783 training images and 4220 test images. We detect 30 categories of objects following Huang et al. (2018a).

Reproduciblity Details  During training, an RoI is considered positive if it has the IoU with a ground-truth box of at least 0.5. , , , and are only defined on positive RoIs. Each image has N sampled RoIs, where the ratio of positive to negative is 1:3 following the protocol presented in Girshick (2015).

We resize the images so that the shorter edges are all 800 pixels. To avoid over-fitting, a data augmentation procedure is performed by randomly flipping the images or randomly shifting the 2D bounding boxes with corresponding labels during the training. We use SGD for optimization with a batch size of 32 on a desktop with 4 Nvidia TITAN RTX cards (8 images each card). The learning rate starts at 0.01 and decays by 0.1 at 30,000 and 35,000 iterations. We implement our framework based on the code of Massa and Girshick (2018). It takes 6 hours to train, and the trained PerspectiveNet provides inference in real-time (20 FPS) using a single GPU.

Since the consistency loss and perspective loss can be substantial during the early stage of the training process, we add them to the joint loss when the learning rate decays twice. The hyper-parameter (e.g., the weights of losses, the architecture of network head) is tuned empirically by a local search.

Evaluation Metric  We evaluate the performance of 3D object detection using the metric presented in Song et al. (2015). Specifically, we first calculate the 3D Intersection over Union (IoU) between the predicted 3D bounding boxes and the ground-truth 3D bounding boxes, and then compute the mean average precision (mAP). Following Huang et al. (2018a), we set the 3D IoU threshold as 0.15 in the absence of depth information.

Qualitative Results  The qualitative results of 2D object detection, 2D perspective point estimation, and 3D object detection are shown in Figure 4. Note that the proposed method performs accurate 3D object detection in some challenging scenes. For the perspective point estimation, even though some of the perspective points are not aligned with image features, the proposed method can still localize their positions robustly.

Quantitative Results  Since the state-of-the-art method Huang et al. (2018a) learns the camera extrinsic parameters jointly, we provide two protocals for evaluations for a fair comparison: (i) PerspectiveNet given ground-truth camera extrinsic parameter (full), and (ii) PerspectiveNet without ground-truth camera extrinsic parameter by learning it jointly following Huang et al. (2018a) (w/o. cam).

We learn the detector for 30 object categories and report the precision-recall (PR) curve of 10 main categories in Figure 5. We calculate the area under the curve to compute AP; Table 1 shows the comparisons of APs of the proposed models with existing approaches (see supplementary materials for the APs of all 30 categories).

0.93

Note that the critical difference between the proposed model and the state-of-the-art method Huang et al. (2018a) is the intermediate representation to learn the 2D-3D consistency. Huang et al. (2018a) uses 2D bounding boxes to enforce a 2D-3D consistency by minimizing the differences between projected 3D boxes and detected 2D boxes. In contrast, the proposed intermediate representation has a clear advantage since projected 3D boxes often are not 2D rectangles, and perspective points eliminate such errors.

Quantitatively, our full model improves the mAP of the state-of-the-art method Huang et al. (2018a) by 14.71%, and the model without the camera extrinsic parameter improves by 10.91%. The significant improvement of the mAP demonstrates the efficacy of the proposed intermediate representation. We defer more analysis on how each component contributes to the overall performance in § 5.1.

 

bed chair sofa table desk toilet bin sink shelf lamp mAP
3DGP Choi et al. (2013) 5.62 2.31 3.24 1.23 - - - - - - -
HoPR Huang et al. (2018b) 58.29 13.56 28.37 12.12 4.79 16.50 0.63 2.18 1.29 2.41 14.01
CooP Huang et al. (2018a) 63.58 17.12 41.22 26.21 9.55 58.55 10.19 5.34 3.01 1.75 23.65
Ours (w/o. cam) 71.39 34.94 55.63 34.10 14.23 73.73 17.47 34.41 4.21 9.54 34.96
Ours (full) 79.69 40.42 62.35 44.12 20.19 81.22 22.42 41.35 8.29 13.14 39.09

 

Table 1: Comparisons of 3D object detection on SUN RGB-D (AP).

5.1 Ablative Analysis

Figure 6: Heatmaps vs. templates for perspective point prediction. (Left) Estimated by heatmap-based method. (Right) Estimated by the proposed template-based method.

In this section, we analyze each major component of the model to examine its contribution to the overall significant performance gain. Specifically, we design six variants of the proposed model.
: The model trained without the perspective point branch, using the 2D offset to predict the 3D center of the object following Huang et al. (2018a).
: The model that aggregates the feature from the perspective point branch and 3D branch directly without the gate function.
: The model that aggregates the feature from the perspective point branch and 3D branch with a gate function that only outputs 0 or 1 (hard gate).
: The model trained without the perspective loss.
: The model trained without the consistency loss.
: The model trained without the perspective branch, perspective loss, or consistency loss.
Table 2 shows the mAP for each variant of the proposed model. The mAP drops 3.86% without the perspective point branch (), 1.66% without the consistency loss (), indicating that the perspective point and re-projection consistency influence the most to the proposed framework. In addition, the switch of gate function (, ) and perspective loss () contribute less to the final performance. Since is still higher than the state-of-the-art result Huang et al. (2018a) with 9.32%, we conjecture this performance gain may come from the one-stage (vs. two-stage) end-to-end training framework and the usage of ground-truth camera parameter; we will further investigate this in future work.

 

Setting Full
mAP 35.23 38.63 38.87 39.01 37.43 32.97 39.09

 

Table 2: Ablative analysis of the proposed model on SUN RGB-D. We evaluate the mAP for 3D object detection.

5.2 Heatmaps vs. Templates

As discussed in § 3.2

, we test two different methods for the perspective point estimation: (i) dense prediction as heatmaps following the human pose estimation mechanism in

He et al. (2017) by adding a parallel heatmap prediction branch, and (ii) template-based regression by the proposed method. The qualitative results (see Figure 6) show that the heatmap-based estimation suffers severely from occlusion and topology change among the perspective points, whereas the proposed template-based regression eases the problem significantly by learning robust sparse templates, capturing consistent topological relations. We also evaluate the quantitative results by computing the average absolute distance between the ground-truth and estimated perspective points. The heatmap-based method has a 10.25 pix error, while the proposed method only has a 6.37 pix error, which further demonstrates the efficacy of the proposed template-based perspective point estimation.

1

5.3 Failure Cases

In a large portion of the failure cases, the perspective point estimation and the 3D box estimation fail at the same time; see Figure 7. It implies that the perspective point estimation and the 3D box estimation are highly coupled, which supports the assumptions that the perspective points encode richer 3D information, and the 3D branch learns meaningful knowledge from the 2D branch. In future work, we may need a more sophisticated and general 3D prior to infer the 3D locations of objects for such challenging cases.

Figure 7: Some failure cases. The perspective point estimation and the 3D box estimation fail at the same time.

5.4 Discussions and Future Work

Comparison with optimization-based methods.

Assume the estimated 3D size or distance is given, it is possible to compute the 3D bounding box with an optimization-based method like efficient PnP. However, the optimization-based methods are sensitive to the accuracy of the given known variables. It is more suitable for tasks with smaller solution spaces (e.g., 6-DoF pose estimation where the 3D shapes of objects are fixed). However, it would be difficult for tasks with larger solution spaces (e.g., 3D object detection where the 3D size, distance, and object pose could vary significantly). Therefore, we argue that directly estimating each variable with constraints imposed among them is a more natural and more straightforward solution.

Potential incorporation with depth information.

The PerspectiveNet estimates the distance between the 3D object center and camera center based on the color image only (pure RGB without any depth information). If the depth information was also provided, the proposed method should be able to make a much more accurate distance prediction.

Potential application to outdoor environment.

It would be interesting to see how the proposed method would perform on outdoor 3D object detection datasets like KITTI Geiger et al. (2013). The differences between the indoor and outdoor datasets for the task of 3D object detection lie in various aspects, including the diversity of object categories, the variety of object dimension, the severeness of the occlusion, the range of the camera angles, and the range of the distance (depth). We hope to adopt the PerspectiveNet in future to the outdoor scenarios.

6 Conclusion

We propose the PerspectiveNet, an end-to-end differentiable framework for 3D object detection from a single RGB image. It uses perspective points as an intermediate representation between 2D input and 3D estimations. The PerspectiveNet adopts an R-CNN structure, where the region-wise branches predict 2D boxes, perspective points, and 3D boxes. Instead of using a direct regression of 2D-3D relations, we further propose a template-based regression for estimating the perspective points, which enforces a better consistency between the predicted 3D boxes and the 2D image input. The experiments show that the proposed method significantly improves existing RGB-based methods.

Acknowledgments

This work reported herein is supported by MURI ONR N00014-16-1-2007, DARPA XAI N66001-17-2-4029, ONR N00014-19-1-2153, and an NVIDIA GPU donation grant.

1

References

  • H. G. Barrow and J. M. Tenenbaum (1981) Interpreting line drawings as three-dimensional surfaces. Artificial Intelligence 17 (1-3), pp. 75–116. Cited by: §1.
  • I. Binford (1971) Visual perception by computer. In IEEE Conference of Systems and Control, Cited by: §1.
  • M. Bosse, R. Rikoski, J. Leonard, and S. Teller (2003) Vanishing points and three-dimensional lines from omni-directional video. The Visual Computer. Cited by: §2.
  • D. Broadbent (1985) A question of levels: comment on mcclelland and rumelhart.. American Psychological Association. Cited by: §1.
  • R. A. Brooks (1981) Symbolic reasoning among 3-d models and 2-d images. Artificial Intelligence 17 (1-3), pp. 285–348. Cited by: §1.
  • X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun (2016) Monocular 3d object detection for autonomous driving. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §1, §2.
  • X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun (2015) 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • W. Choi, Y. Chao, C. Pantofaru, and S. Savarese (2013) Understanding indoor scenes using 3d geometric phrases. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Table 1.
  • J. M. Coughlan and A. L. Yuille (1999)

    Manhattan world: compass direction from a single image by bayesian inference

    .
    In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • J. M. Coughlan and A. L. Yuille (2003)

    Manhattan world: orientation and outlier detection by bayesian inference

    .
    Neural Computation. Cited by: §1.
  • E. Delage, H. Lee, and A. Y. Ng (2007) Automatic single-image 3d reconstructions of indoor manhattan world scenes. In Robotics Research, pp. 305–321. Cited by: §2.
  • Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski (2009) Manhattan-world stereo. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR) 32 (11), pp. 1231–1237. Cited by: §5.4.
  • B. Ghanem, A. Thabet, J. Carlos Niebles, and F. Caba Heilbron (2015) Robust manhattan frame estimation from a single rgb-d image. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • R. Girshick (2015) Fast r-cnn. In International Conference on Computer Vision (ICCV), Cited by: §3.1, §5.
  • C. Guo, S. Zhu, and Y. N. Wu (2003) Towards a mathematical theory of primal sketch and sketchability. In International Conference on Computer Vision (ICCV), Cited by: §1.
  • C. Guo, S. Zhu, and Y. N. Wu (2007) Primal sketch: integrating structure and texture. Computer Vision and Image Understanding (CVIU) 106 (1), pp. 5–19. Cited by: §1.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In International Conference on Computer Vision (ICCV), Cited by: §1, §1, §3.1, §3.1, §3.2.1, §4, §5.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.
  • T. He and S. Soatto (2019) Mono3D++: monocular 3d vehicle detection with two-scale 3d hypotheses and task priors. arXiv preprint arXiv:1901.03446. Cited by: §1, §2.
  • V. Hedau, D. Hoiem, and D. Forsyth (2009) Recovering the spatial layout of cluttered rooms. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • V. Hedau, D. Hoiem, and D. Forsyth (2010) Thinking inside the box: using appearance models and context based on room geometry. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • S. Huang, S. Qi, Y. Xiao, Y. Zhu, Y. N. Wu, and S. Zhu (2018a) Cooperative holistic scene understanding: unifying 3d object, layout, and camera pose estimation. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2, §3.3, §3.3, §3.4, §5.1, Table 1, §5, §5, §5, §5, §5.
  • S. Huang, S. Qi, Y. Zhu, Y. Xiao, Y. Xu, and S. Zhu (2018b) Holistic 3d scene parsing and reconstruction from a single rgb image. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, §2, Table 1.
  • H. Izadinia, Q. Shan, and S. M. Seitz (2017) IM2CAD. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • B. Julesz (1962) Visual pattern discrimination. IRE transactions on Information Theory 8 (2), pp. 84–92. Cited by: §1.
  • B. Julesz (1981) Textons, the elements of texture perception, and their interactions. Nature 290 (5802), pp. 91. Cited by: §1.
  • T. Kanade (1981) Recovery of the three-dimensional shape of an object from a single view. Artificial intelligence 17 (1-3), pp. 409–460. Cited by: §1.
  • K. Koffka (2013) Principles of gestalt psychology. Routledge. Cited by: §1.
  • W. Köhler (1920) Die physischen gestalten in ruhe und im stationärenzustand. eine natur-philosophische untersuchung [the physical gestalten at rest and in steady state]. Braunschweig, Germany: Vieweg und Sohn.. Cited by: §1.
  • W. Köhler (1938) Physical gestalten. In A source book of Gestalt psychology, pp. 17–54. Cited by: §1.
  • T. Kroeger, D. Dai, and L. Van Gool (2015) Joint vanishing point extraction and tracking. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • A. Kundu, Y. Li, and J. M. Rehg (2018) 3D-rcnn: instance-level 3d object reconstruction via render-and-compare. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • C. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich (2017) RoomNet: end-to-end room layout estimation. In International Conference on Computer Vision (ICCV), Cited by: §1, §3.2.1.
  • D. C. Lee, M. Hebert, and T. Kanade (2009) Geometric reasoning for single image structure recovery. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • D. Lin, S. Fidler, and R. Urtasun (2013) Holistic scene understanding for 3d object detection with rgbd cameras. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.
  • X. Liu, Y. Zhao, and S. Zhu (2017) Single-view 3d scene reconstruction and parsing by attribute grammar. Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40 (3), pp. 710–725. Cited by: §2.
  • D. G. Lowe (1987) Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence 31 (3), pp. 355–395. Cited by: §1.
  • D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §1.
  • D. Lowe (2012) Perceptual organization and visual recognition. Vol. 5, Springer Science & Business Media. Cited by: §1.
  • D. Marr and H. K. Nishihara (1978) Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society of London. Series B. Biological Sciences 200 (1140), pp. 269–294. Cited by: §1.
  • D. Marr (1982) Vision: a computational investigation into the human representation and processing of visual information. WH Freeman. Cited by: §1, §1.
  • F. Massa and R. Girshick (2018)

    maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch

    .
    Note: https://github.com/facebookresearch/maskrcnn-benchmark Cited by: §5.
  • A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká (2017) 3d bounding box estimation using deep learning and geometry. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.4.
  • A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), Cited by: §1, §3.2.1.
  • M. Nitzberg and D. Mumford (1990) The 2.1-d sketch. In ICCV, Cited by: §1.
  • B. A. Olshausen and D. J. Field (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381 (6583), pp. 607. Cited by: §3.2.1.
  • A. P. Pentland (1987) Perceptual organization and the representation of natural form. In Readings in Computer Vision, pp. 680–699. Cited by: §1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §3.1, §3.1, §4.
  • Z. Ren and E. B. Sudderth (2016) Three-dimensional object detection and layout prediction using clouds of oriented gradients. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess (2016) Unsupervised learning of 3d structure from images. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.4.
  • G. Schindler and F. Dellaert (2004)

    Atlanta world: an expectation maximization framework for simultaneous low-level edge grouping and camera calibration in complex man-made environments

    .
    In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun (2012) Efficient structured prediction for 3d indoor scene understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • S. Song, S. P. Lichtenberg, and J. Xiao (2015) Sun rgb-d: a rgb-d scene understanding benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §5, §5.
  • J. Straub, N. Bhandari, J. J. Leonard, and J. W. Fisher (2015) Real-time manhattan world rotation estimation in 3d. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §2.
  • J. Straub, G. Rosman, O. Freifeld, J. J. Leonard, and J. W. Fisher (2014) A mixture of manhattan frames: beyond the manhattan world. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • S. Suwajanakorn, N. Snavely, J. J. Tompson, and M. Norouzi (2018) Discovery of latent 3d keypoints via end-to-end geometric reasoning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §3.2.1.
  • B. Tekin, S. N. Sinha, and P. Fua (2018) Real-time seamless single shot 6d object pose prediction. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • S. Tulsiani, S. Gupta, D. Fouhey, A. A. Efros, and J. Malik (2018) Factoring shape, pose, and layout from the 2d image of a 3d scene. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • J. Wagemans, J. H. Elder, M. Kubovy, S. E. Palmer, M. A. Peterson, M. Singh, and R. von der Heydt (2012a) A century of gestalt psychology in visual perception: i. perceptual grouping and figure–ground organization.. Psychological bulletin 138 (6), pp. 1172. Cited by: §1.
  • J. Wagemans, J. Feldman, S. Gepshtein, R. Kimchi, J. R. Pomerantz, P. A. Van der Helm, and C. Van Leeuwen (2012b) A century of gestalt psychology in visual perception: ii. conceptual and theoretical foundations.. Psychological bulletin 138 (6), pp. 1218. Cited by: §1.
  • D. Waltz (1975) Understanding line drawings of scenes with shadows. In The psychology of computer vision, Cited by: §1.
  • J. Y. Wang and E. H. Adelson (1993) Layered representation for motion analysis. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • J. Y. Wang and E. H. Adelson (1994) Representing moving images with layers. Transactions on Image Processing (TIP) 3 (5), pp. 625–638. Cited by: §1.
  • M. Wertheimer (1912) Experimentelle studien uber das sehen von bewegung [experimental studies on the seeing of motion]. Zeitschrift fur Psychologie 61, pp. 161–265. Cited by: §1.
  • M. Wertheimer (1923) Untersuchungen zur lehre von der gestalt, ii. [investigations in gestalt theory: ii. laws of organization in perceptual forms]. Psychologische Forschung 4, pp. 301–350. Cited by: §1.
  • M. Wertheimer (1938) Laws of organization in perceptual forms. In A source book of Gestalt psychology, pp. 71–94. Cited by: §1.
  • J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum (2017) Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2, §3.4.
  • J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman (2016) Single image 3d interpreter network. In European Conference on Computer Vision (ECCV), Cited by: §2, §3.4.
  • Y. N. Wu, Z. Si, H. Gong, and S. Zhu (2010) Learning active basis model for object detection and recognition. International Journal of Computer Vision (IJCV) 90 (2), pp. 198–235. Cited by: §3.2.1.
  • J. Xiao and Y. Furukawa (2014) Reconstructing the world’s museums. International Journal of Computer Vision (IJCV). Cited by: §1, §2.
  • J. Xiao, J. Hays, B. C. Russell, G. Patterson, K. Ehinger, A. Torralba, and A. Oliva (2013) Basic level scene understanding: categories, attributes and structures. Frontiers in psychology 4, pp. 506. Cited by: §2.
  • B. Xu and Z. Chen (2018) Multi-level fusion based 3d object detection from monocular images. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee (2016) Perspective transformer nets: learning single-view 3d object reconstruction without 3d supervision. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.4.
  • S. Yao, T. M. Hsu, J. Zhu, J. Wu, A. Torralba, B. Freeman, and J. Tenenbaum (2018) 3D-aware scene manipulation via inverse graphics. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.
  • X. Zhang, Z. Zhang, C. Zhang, J. B. Tenenbaum, W. T. Freeman, and J. Wu (2018) Learning to reconstruct shapes from unseen classes. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • Y. Zhang, S. Song, P. Tan, and J. Xiao (2014) Panocontext: a whole-room 3d context model for panoramic scene understanding. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • Y. Zhao and S. Zhu (2011) Image parsing with stochastic scene grammar. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • Y. Zhao and S. Zhu (2013) Scene parsing by integrating function, geometry and appearance models. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • J. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. B. Tenenbaum, and W. T. Freeman (2018) Visual object networks: image generation with disentangled 3d representations. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • S. C. Zhu, Y. Wu, and D. Mumford (1998) Filters, random fields and maximum entropy (frame): towards a unified theory for texture modeling. International Journal of Computer Vision (IJCV) 27 (2), pp. 107–126. Cited by: §1.
  • S. Zhu, C. Guo, Y. Wang, and Z. Xu (2005) What are textons?. International Journal of Computer Vision (IJCV) 62 (1-2), pp. 121–143. Cited by: §1.
  • C. Zou, A. Colburn, Q. Shan, and D. Hoiem (2018) LayoutNet: reconstructing the 3d room layout from a single rgb image. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.2.1.