Localizing and reconstructing 3D objects from RGB video is a fundamental problem in computer vision. Traditional geometry-based multi-view reconstruction[schoenberger2016sfm, schoenberger2016mvs] can deal with large scenes given rich textures and large baselines but it is prone to failure in texture-less regions or when the photo-consistency assumption does not hold. Besides, these methods only provide geometry information but no semantics. An even more challenging question is how to fill in unobserved regions of the scene. Recently, learning based 3D reconstruction methods [3D-R2D2, PSGN, 3D-VAE-GAN, OccupancyNetworks, gkioxari2019mesh, chabra2019stereodrnet] have emerged and achieved promising results. However, data-driven approaches rely heavily on synthetic renderings and do not generalize well to natural images. On the other hand, we have seen impressive progress in 2D recognition tasks such as detection and segmentation [he2017mask, long2015fully, kirillov2019panoptic].
In this paper, we propose a system for object-centric reconstruction that leverages the best properties of 2D recognition, learning-based object reconstruction and multi-view optimization with deep shape priors. As illustrated in Fig. 2 FroDO takes a sequence of localized RGB images as input, and progressively outputs 2D and 3D bounding boxes, 7-DoF pose, a sparse point cloud and a dense mesh for 3D objects in a coarse-to-fine manner. FroDO demonstrates deep prior-based 3D reconstruction of real world multi-class and multi-object scenes from real-world RGB video. Related approaches are limited to single view [AtlasNet, 3D-VAE-GAN, PSGN, OccupancyNetworks, grabner20183d], or multi-view but single objects [2019-lin], are purely geometry-based [schoenberger2016sfm, schoenberger2016mvs], or require depth and object scans [2019-hou-sis, 2019-hou-sic].
Choosing the best shape representation remains a key open problem in 3D reconstruction. Signed Distance Functions (SDF) have emerged as a powerful representation for learning-based reconstruction [2019-park, OccupancyNetworks] but are not as compact or efficient as point clouds. One of our key contributions is a new joint embedding where shape codes can be decoded to both a sparse point cloud and a dense SDF. Our joint shape embedding enables seamless switching between both representations and can be used as a shape prior for shape optimization, enabling faster inference. As Fig. 2 illustrates, FroDO takes a calibrated, and localized image sequence as input and proceeds in four distinct steps: 2D detection, data association, single-view shape code inference and multi-view shape code optimization. First, per-frame 2D bounding box detections are inferred using an off-the-shelf method [he2017mask]. Secondly, bounding boxes are associated over multiple frames and lifted into 3D. Next, a D code is predicted for each detection of the same object instance, using a novel encoder network. Per-image shape codes of the same instance are fused into a single code. Finally, shape code and pose are further refined by minimizing terms based on geometric, photometric and silhouette cues using our joint embedding as a shape prior. The final outputs of our system are dense object meshes placed in the correct position and orientation in the scene.
The contributions of our paper are as follows: (i) FroDO takes as input RGB sequences of real world multi-object scenes and infers an object-based map, leveraging 2D recognition, learning-based 3D reconstruction and multi-view optimization with shape priors. (ii) We introduce a novel deep joint shape embedding that allows simultaneous decoding to sparse point cloud and continuous SDF representations, and enables faster shape optimization. (iii) We introduce a new coarse-to-fine multi-view optimization approach that combines photometric and silhouette consistency costs with our deep shape prior. (iv) FroDO outperforms state of the art 3D reconstruction methods on real-world datasets — Pix3D [pix3d] for single-object single-view and Redwood-OS [2016choiredwood] for single-object multi-view. We demonstrate multi-class and multi-object reconstruction on challenging sequences from the ScanNet dataset [dai2017scannet].
2 Related Work
At its core our proposed system infers dense object shape reconstructions from RGB frames, so it relates to multiple areas in 3D scene reconstruction and understanding.
Single-view learning-based shape prediction
In recent years, 3D object shape and pose estimation from images has moved from being purely geometric towards learning techniques, which typically depend on synthetic rendering of ShapeNet[chang2015shapenet] or realistic 2d-3d datasets like Pix3d [pix3d]. These approaches can be categorized based on the shape representation utilized, for example occupancy grids [3D-R2D2, 3D-VAE-GAN], point clouds [PSGN], meshes [wang2018pixel2mesh], or implicit functions [OccupancyNetworks]. Gkioxari et al. [gkioxari2019mesh] jointly train detection and reconstruction by augmenting Mask RCNN with an extra head that outputs volume and mesh.
Our coarse-to-fine reconstruction pipeline includes a single-image encoder decoder network that predicts a latent shape code, point cloud, and SDF for each detected instance. Our single-view reconstruction network leverages a novel joint embedding that simultaneously outputs point cloud and SDF (Fig. 3). Our quantitative evaluation shows that our approach provides better single view reconstruction than competing methods.
Multi-view category-specific shape estimation
Structure-from-Motion (SfM) and simultaneous localization and mapping (SLAM) are useful to reconstruct 3D structure from image collections or videos. However, traditional methods are prone to failure when there is a large gap between viewpoints, generally have difficulty with filling featureless areas, and cannot reconstruct occluded surfaces. Deep learning approaches like 3D-R2N2[3D-R2D2], LSM [kar2017learning], and Pix2Vox [xie2019pix2vox] have been proposed for 3D object shape reconstruction. These can infer object shape from either single or multiple observations using RNN or voxel based fusion. However, these fusion techniques are slow and data association is assumed.
3D reconstruction with shape priors These methods are the most closely related to our approach since they also use RGB video as input and optimize object shape and pose using 3D or image-based reprojection losses such as photometric and/or silhouette terms while assuming, often category-specific, learnt compact latent shape spaces. Some examples of the low dimensional latent spaces used are PCA [2019-wang, li2018optimizable], GPLVM [2012-prisacariu, prisacariu2011shared, dame2013dense]
or a learnt neural network[2019-lin]. In similar spirit we optimize a shape code for each object, using both 2D and 3D alignment losses, but we propose a new shape embedding that jointly encodes point cloud and DeepSDF representations and show that our coarse-to-fine optimization leads to more accurate results. These optimizable codes have also been used to infer the overall shape of entire scenes [2018-bloesch, sitzmann2019scene] without lifting the representation to the level of objects. Concurrent work [liu2019dist] proposes to optimize DeepSDF embeddings via sphere tracing, closely related to FroDO’s dense optimization stage. We chose to formulate the energy via a proxy mesh, which scales better when many views are used.
Object-aware SLAM Although our system is not sequential or real-time, it shares common ground with recent object-oriented SLAM methods. Visual SLAM has recently evolved from purely geometric mapping (point, surface or volumetric based) to object-level representations which encode the scene as a collection of reconstructed object instances. SLAM++ [salas2013slam] demonstrated one of the first RGB-D object-based mapping systems where a set of previously known object instances were detected and mapped using an object pose graph. Other instance-based object-aware SLAM systems have either aligned objects from a pre-trained database to volumetric maps [Tateno2016When2I] or models learnt during an exploration step to a surfel representation [stuckler2012model]. In contrast, others have focused on online object discovery and modeling [Choudhary2014SLAMWO] to deal with unknown object instances, dropping the need for known models and pre-trained detectors. Recent RGB-D object-aware SLAM methods leverage the power of state of the art 2D instance semantic segmentation masks [he2017mask] to obtain object-level scene graphs and per-object reconstructions [2018-mccormac] even in the case of dynamic scenes [runz2018maskfusion, xu2019mid]. Object oriented SLAM has also been extended to the case of monocular RGB-only [pillai2015monocular, nicholson2018quadricslam, parkhiya2018constructing, hosseinzadeh2019real, galvez2015realtimeMO] or visual inertial inputs [fei2018visual]. Pillai and Leonard [pillai2015monocular] aggregate multiview detections to perform SLAM-aware object recognition and semi-dense reconstruction, while [nicholson2018quadricslam] fit per-object 3D quadric surfaces. CubeSLAM [Yang2019CubeSLAMM3] proposes a multi-step object reconstruction pipeline where initial cuboid proposals, generated from single view detections, are further refined through multiview bundle-adjustment.
3 Method Overview
FroDO infers an object-based map of a scene, in a coarse-to-fine manner, given a localized set of RGB images. We assume camera poses and a sparse point cloud have been estimated using standard SLAM or SfM methods such as ORB-SLAM [mur2015orb] or COLMAP [schoenberger2016sfm, schoenberger2016mvs]. We represent the object-based map as a set of object poses with associated 3D bounding boxes and shape codes . denotes a transformation from coordinate system to . Our new joint shape embedding is described in Sec. 4.
The steps in our pipeline are illustrated in Fig. 2: First (Sec. 5.1) objects are detected in input images using any off-the-shelf detector [he2017mask], correspondences are established between detections of the same object instance in different images and 2D bounding boxes are lifted into 3D, which enables occlusion reasoning for view selection. Second, a
D shape code is predicted for each visible cropped detection of the same object, using a novel convolutional neural network (Sec.5.2). Codes are later fused into a unique object shape code (Sec. 5.2). Finally, object poses and shape code are incrementally refined by minimizing energy terms based on geometric and multiview photometric consistency cues using our joint shape embedding as a prior (Sec. 5.3).
4 Joint Shape Code Embedding
We propose a new joint latent shape-code space to represent and instantiate complete object shapes in a compact way. This novel embedding is also used as a shape prior to efficiently optimize object shapes from multi-view observations. We parametrize object shapes with a latent code , which can be jointly decoded by two generative models and into an explicit sparse 3D pointcloud and an implicit signed distance function . While the pointcloud decoder generates samples on the object surface, the SDF decoder represents the surface densely via its zero-level set. The decoders are trained simultaneously using a supervised reconstruction loss against ground-truth shapes on both representations:
where evaluates a symmetric Chamfer distance, and is a clipped loss between predicted and ground-truth signed distance values with a threshold . We use 3D models from the CAD model repository ShapeNet [chang2015shapenet] as ground truth shapes. While the original DeepSDF architecture [2019-park] is employed for the SDF decoder, a variant of PSGN [PSGN] is used as the pointcloud decoder. Its architecture is described in detail in the supplementary material. Joint embeddings decoded to both representations are illustrated in Fig. 4. The trained decoders allow us to leverage learnt object shape distributions, and act as effective priors for optimization based 3D reconstruction. In contrast to related prior-based shape optimization approaches [2019-lin, liu2019dist] where the shape embedding is specialized to a specific representation, our embedding offers the advantages of both sparse and dense representations at different stages of the optimization. Although DeepSDF can represent smooth and dense object surfaces, it is slow to evaluate as each point needs a full forward pass through the decoder. In contrast, the pointcloud representation is two orders of magnitude faster but fails to capture shape details. Our strategy is therefore to infer an initial shape using the point-based decoder before switching to the DeepSDF decoder for further refinement (Sec. 5.3). While inspired by [2019-muralikrishnan] to use multiple shape representations, our embedding offers two advantages. First, the same latent code is used by both decoders, which avoids the need for a latent code consistency loss [2019-muralikrishnan]. Secondly, training a shape encoder for each representation is not required.
5 From Detections to 3D Objects
5.1 Object Detection and Data Association
We use a standard instance segmentation network [he2017mask] to detect object bounding boxes and object masks in the input RGB video.
To enable multi-view fusion and data aggregation for object shape inference, we predict correspondences between multiple detections of the same 3D object instance. Since the 3D ray through the center of a 2D bounding box points in the direction of the object center, the set of rays from all corresponding detections should approximately intersect. Knowledge of the object class sets reasonable bounds on the object scale to further restrict the expected object center location in 3D to a line segment as indicated by the thicker line segments in Fig. 5.
Object instance data association can then be cast as a clustering problem in which the goal is to identify an unknown number of line segment sets that approximately intersect in a single point. We adopt an efficient iterative non-parametric clustering approach similar to DP-means [kulis2011revisiting] where the observations are line segments and the cluster centers are 3D points. Further details of the clustering algorithm are given in the supplementary material.
After clustering, each object instance is associated with a set of 2D image detections and a 3D bounding box , computed from the associated bounding box detections as described in [nicholson2018quadricslam]. By comparing the projection of the 3D object bounding box and the 2D detection box, we reject detections that have low IoU, an indication of occlusions or truncations. The filtered set of image detections is used in all following steps. Examples of the filtered detections are shown in supplementary material.
5.2 Single-view Shape Code Inference and Fusion
As illustrated in the shape encoding section of Fig. 2, a D object shape code is predicted for each filtered detection. We train a new encoder network that takes as input a single image crop and regresses its associated shape code in the joint latent space described in Sec. 4.
The network is trained in a fully supervised way. However, due to the lack of 3D shape annotations for real world image datasets, we train the image encoder using synthetic ShapeNet [chang2015shapenet]
renderings. Specifically, we generate training data by rendering ShapeNet CAD models with random viewpoints, materials, environment mapping, and background. We also perturb bounding boxes of rendered objects and feed perturbed crops to the encoder during training. We chose a standard ResNet architecture, modifying its output to the size of the embedding vector. During training, we minimize the Huber loss between predicted and target embeddings, which we know for all CAD models. For the experiment on ScanNet in Sec.6.3, we fine-tune the encoder network with supervised data from Pix3D[pix3d].
Multi-view Shape Code Fusion For each object instance we fuse all single-view shape codes into a unique code . We propose two fusion approaches and evaluate them in Table 4: (i) Average – we average shape codes to form a mean code ; (ii) Majority voting – We find the 4 nearest neighbors of each predicted code among the models in the training set. The most frequent of these is chosen as . Unlike the average code, guarantees valid shapes from the object database.
5.3 Multi-view Optimization with Shape Priors
For each object instance , all images with non-occluded detections are used as input to an energy optimization approach to estimate object pose and shape code in two steps. First, we optimize the energy over a sparse set of surface points, using the point decoder as a shape prior. This step is fast and efficient due to the sparse nature of the representation as well as the light weight of the pointcloud decoder. Second, we further refine pose and shape minimizing the same energy over dense surface points, now using the DeepSDF decoder as the prior. This slower process is more accurate since the loss is evaluated over all surface points, and not sparse samples.
Energy. Our energy is a combination of losses on the 2D silhouette , photometric consistency and geometry with a shape code regularizer :
where weigh the contributions of individual terms. The regularization term encourages shape codes to take values in valid regions of the embedding, analogously to the regularizer in Eq. 1. Note that the same energy terms are used for sparse and dense optimization – the main differences being the number of points over which the loss is evaluated, and the decoder used as shape prior.
Initialization. The D shape code is initialized to the fused shape code (Sec. 5.2), while the pose is initialized from the 3D bounding box (Sec. 5.1): translation is set to the vector joining the origin of the world coordinate frame with the 3D bounding box centroid, scale to the 3D bounding box height and rotation is initialised using exhaustive search for the best rotation about the gravity direction – under the assumption that objects are supported by a ground-plane perpendicular to gravity.
Sparse Optimization. Throughout the sparse optimization, the energy is defined over the sparse set of surface points , decoded with the point-based decoder . The energy is minimized using the Adam optimizer [kingma2014adam] with autodiff. We now define the energy terms.
The photometric loss encourages the colour of 3D points to be consistent across views. In the sparse case, we evaluate by projecting points in to nearby frames via known camera poses and comparing colors in reference and source images under a Huber norm :
where projects 3D point into the image.
The silhouette loss penalizes discrepancies between the 2D silhouette obtained via projection of the current 3D object shape estimate and the mask predicted with MaskRCNN [he2017mask]. In practice, we penalize points that project outside the predicted mask using the 2D Chamfer distance:
where is the set of 2D samples on the predicted mask and is the symmetric Chamfer distance defined in Eq. 3.
The geometric loss minimizes the 3D Chamfer distance between 3D SLAM (or SfM) points and points on the current object shape estimate:
Dense Optimization. The shape code and pose estimated with the sparse optimization can be further refined with a dense optimization over all surface points and using the DeepSDF decoder . Since uses an implicit representation of the object surface, we compute a proxy mesh at each iteration, and formulate the energy over its vertices. This strategy proved faster than sphere tracing [liu2019dist], while achieving on-par accuracy, see Table 3. Relevant Jacobians are derived analytically and are given in the supplementary material together with further implementation details. We now describe the dense energy terms.
The photometric and geometric losses are equivalent to those used in the sparse optimization (see Eq. 5, 7). However, they are evaluated densely and the photometric optimization makes use of a Lucas-Kanade style warp.
The silhouette loss
takes a different form to the sparse case. We follow traditional level set approaches, comparing the projections of object estimates with observed foreground and background probabilities:
where is a 3D or 2D shape-kernel, and a mapping to a 2D foreground probability field, resembling an object mask of the current state. Empirically, we found that 3D shape-kernels [2012-prisacariu-pwp3d] provide higher quality reconstructions when compared with a 2D formulation [2012-prisacariu] because more regions contribute to gradients. While
is a Heaviside function in the presence of 2D level-sets, we interpret signed distance samples of the DeepSDF volume as logits and compute a per-pixel foreground probability by accumulating samples along rays, similar to Prisacariuet al. [2012-prisacariu]:
where is a smoothing coefficient, and the background probability at a sampling location . A step-size of is chosen, where is the depth range of the object-space unit-cube.
6 Experimental Evaluation
Our focus is to evaluate the performance of FroDO on real-world datasets wherever possible. We evaluate quantitatively in two scenarios: (i) single-view, single object on Pix3D [pix3d]; and (ii) multi-view, single object on the Redwood-OS [2016choiredwood] dataset. In addition, we evaluate our full approach qualitatively on challenging sequences from the real-world ScanNet dataset [dai2017scannet] that contain multiple object instances. In all cases we use MaskRCNN [he2017mask] to predict object detections and masks. We run Orb-SLAM [mur2015orb] to estimate trajectories and keypoints on Redwood-OS but use the provided camera poses and no keypoints on ScanNet.
6.1 Single-View Object Reconstruction
|Sun et al. [pix3d]||0.282||0.118||0.119|
|Ours (DeepSDF Embedding)||0.302||0.112||0.103|
|Ours (Joint Embedding)||0.325||0.104||0.099|
First we evaluate the performance of our single-view shape code prediction network (Sec. 5.2) on the real world dataset Pix3D [pix3d]. Table 1 shows a comparison with competing approaches on the chair category. The evaluation protocol described in [pix3d] was used to compare IoU, Earth Mover Distance (EMD) and Chamfer Distance (CD) errors (results of competing methods are from [pix3d]). Our proposed encoder network outperforms related work in all metrics. Table 1 also shows an improvement in performance when our new joint shape embedding is used (Ours Joint Embedding) instead of DeepSDF [2019-park] (Ours DeepSDF Embedding). Figure 6 shows example reconstructions.
6.2 Multi-View Single Object Reconstruction
We quantitatively evaluate our complete multi-view pipeline on the chair category of the real-world Redwood-OS dataset [2016choiredwood] which contains single object scenes. We perform two experiments: an ablation study to motivate the choice of terms in the energy function (Table 2) and a comparison of the performance of the different steps of our pipeline with related methods (Table 4). Table 3 includes a comparison of our dense photometric optimization with the two closest related approaches [2019-lin, liu2019dist] on a commonly-used synthetic dataset [2019-lin].
|Optim. Method||Energy Terms||CD (cm.)|
|Sparse + Dense|
|Sparse + Dense|
Ablation study. Table 2 shows an ablation study on different energy terms in our sparse and dense optimizations (Eq. 4). The combination of geometric and photometric cues with a regularizer on the latent space achieves best results. The supplementary material includes further experiments on the effect of filtering object detections (Sec. 5.1) and the efficiency gains of using our joint embedding.
|PMO (o)||PMO (r)||DIST (r)||Ours (r)|
Synthetic dataset. Table 3 shows a direct comparison of the performance on the synthetic PMO test set [2019-lin] of our dense optimization when only the photometric loss is used in our energy, with the two closest related methods: PMO [2019-lin] and DIST [liu2019dist]. Notably, both DIST and our approach achieve comparable results to PMO from only random initializations. When PMO is also initialized randomly the results degrade substantially.
Redwood-OS dataset. Table 4 shows a comparison with Pix2Vox [xie2019pix2vox], a purely deep learning approach, and with PMO [2019-lin], both of which are state-of-the-art. For reference, we also compare with COLMAP [schoenberger2016sfm, schoenberger2016mvs] a traditional SFM approach. Since COLMAP reconstructs the full scene without segmenting objects, we only select points within the ground-truth 3D bounding box for evaluation. We report errors using: Chamfer distance (CD), accuracy (ACC (5cm)), completion (COMP (5cm)) and F1 score – all four commonly used when evaluating on Redwood-OS. Chamfer distance (CD) measures the symmetric error, while shape accuracy captures the 3D error as the distance between predicted points to their closest point in the ground truth shape and vice-versa in the case of shape completion. Both shape accuracy and completion are measured in percentage of points with an error below cm. Following [2019-lin], we use an average of input frames sampled from the RGB sequences, though for completeness we show results with views. Fig. 7 shows example reconstructions.
We outperform Xie et al. [xie2019pix2vox] by a significant margin which could point to the lack of generalization of purely learning based approaches. We also outperform PMO [2019-lin], a shape prior based optimization approach like ours, but which lacks our proposed coarse-to-fine shape upgrade. COLMAP fails to reconstruct full 3D shapes when the number of input images or the baseline of viewpoints is limited as it cannot leverage pre-learnt object priors. Although, as expected, the performance of COLMAP increases drastically with the number of input images, it requires hundreds of views to perform comparably to our approach.
6.3 Multi-Object Reconstruction
We demonstrate qualitative results of our full approach on the ScanNet dataset [dai2017scannet] on challenging real world scenes with multiple object instances in Fig. 1 and Fig. 8. MaskRCNN [he2017mask] was used to predict 2D b-boxes and masks. The association of object detections to 3D object instances becomes an additional challenge when dealing with multi-object scenarios. Our results show that our ray clustering approach successfully associates detected bounding boxes across frames and our coarse-to-fine optimization scheme provides high quality object poses and reconstructions.
7 Conclusions and Discussion
We introduced FroDO, a novel object-oriented 3D reconstruction framework that takes localized monocular RGB images as input and infers the location, pose and accurate shape of the objects in the scene. Key to FroDO is the use of a new deep learnt shape encoding throughout the different shape estimation steps. We demonstrated FroDO on challenging sequences from real-world datasets in single-view, multi-view and multi-object settings. An exciting open challenge would be to extend FroDO to the case of dynamic scenes with independently moving objects.
Acknowledgement The Univ. of Adelaide authors’ work has been supported by the Australian Research Council through the Centre of Excellence for Robotic Vision CE140100016 and Laureate Fellowship FL130100102, and UCL authors’ work has been supported by the SecondHands project, funded from the EU Horizon 2020 Research and Innovation programme under GA No 643950.
|Method||Few observations (average 35 views)||Over-complete observations (average 350 views)|
|CD (cm)||ACC (cm)||COMP (cm)||F1 score||CD (cm)||ACC (cm)||COMP (cm)||F1 score|
|COLMAP [schoenberger2016sfm, schoenberger2016mvs]||10.58||84.16||54.28||65.99||6.05||91.41||94.59||92.97|
|FroDO Code Fusion (Vote)||12.19||60.74||60.55||60.64||11.97||61.37||58.20||59.74|
|FroDO Code Fusion (Aver.)||10.74||61.31||72.11||66.27||10.57||61.06||72.14||66.14|
|FroDO Optim. Sparse||8.69||70.58||79.10||74.60||8.59||71.69||81.63||76.34|
|FroDO Optim. Dense||7.38||73.70||80.85||76.64||7.37||74.78||81.08||77.32|
|RGB||GT Scan||COLMAP||PMO||FroDO (sparse)||FroDO (dense)|