Deep Single-View 3D Object Reconstruction with Visual Hull Embedding
3D object reconstruction is a fundamental task of many robotics and AI problems. With the aid of deep convolutional neural networks (CNNs), 3D object reconstruction has witnessed a significant progress in recent years. However, possibly due to the prohibitively high dimension of the 3D object space, the results from deep CNNs are often prone to missing some shape details. In this paper, we present an approach which aims to preserve more shape details and improve the reconstruction quality. The key idea of our method is to leverage object mask and pose estimation from CNNs to assist the 3D shape learning by constructing a probabilistic single-view visual hull inside of the network. Our method works by first predicting a coarse shape as well as the object pose and silhouette using CNNs, followed by a novel 3D refinement CNN which refines the coarse shapes using the constructed probabilistic visual hulls. Experiment on both synthetic data and real images show that embedding a single-view visual hull for shape refinement can significantly improve the reconstruction quality by recovering more shapes details and improving shape consistency with the input image.READ FULL TEXT VIEW PDF
3D shape reconstruction from a single image is a highly ill-posed proble...
Much recent progress has been made in reconstructing the 3D shape of an
We advocate the use of differential visual shape metrics to train deep n...
A robust single-shot 3D shape reconstruction technique integrating the f...
Bolts are the most numerous fasteners in transmission lines and are pron...
This article presents a mathematical framework to simultaneously tackle ...
Reconstructing 3D shapes from single-view images has been a long-standin...
Deep Single-View 3D Object Reconstruction with Visual Hull Embedding
Recovering the dense 3D shapes of objects from 2D imageries is a fundamental AI problem which has many applications such as robot-environment interaction, 3D-based object retrieval and recognition, etc. Given a single image of an object, a human can reason the 3D structure of the object reliably. However, single-view 3D object reconstruction is very challenging for computer algorithms.
Recently, a significant progress of single-view 3D reconstruction has been achieved by using deep convolutional neural networks (CNNs) [Choy et al.2016, Girdhar et al.2016, Wu et al.2016, Yan et al.2016, Fan, Su, and Guibas2017, Tulsiani et al.2017b, Zhu et al.2017, Wu et al.2017, Tulsiani, Efros, and Malik2018]. Most CNN-based methods reconstruct the object shapes using 2D and 3D convolutions in a 2D encoder-3D decoder structure with the volumetric 3D representation. The input to these CNNs are object images taken under unknown viewpoints, while the output shapes are often aligned with the canonical viewpoint in a single, pre-defined 3D coordinate system such that shape regression is more tractable.
Although promising results have been shown by these CNN-based methods, single-view 3D reconstruction is still a challenging problem and the results are far from being perfect. One of the main difficulties lies in the object shape variations which can be very large even in a same object category. The appearance variations in the input images caused by pose differences make this task even harder. Consequently, the results from CNN-based methods are prone to missing some shape details and sometimes generate plausible shapes which, however, are inconsistent with input images, as shown in Figure 1.
In this paper, we propose an approach to improve the fidelity of the reconstructed shapes by CNNs. Our method combined traditional wisdom into the network architecture. It is motivated by two observations: 1) while directly recovering all the shape details in 3D is difficult, extracting the projected shape silhouette on the 2D plane, i.e. segmenting out the object from background in a relatively easy task using CNNs; 2) for some common objects such as chairs and cars whose 3D coordinate systems are well defined without ambiguity, the object pose (or equivalently, the viewpoint) can also be well estimated by a CNN [Su et al.2015, Massa, Marlet, and Aubry2016]. As such, we propose to leverage the object silhouettes to assist the 3D learning by lifting them to 3D using pose estimates.
Figure 2 is a schematic description of our method, which is a pure GPU-friendly neural network solution. Specifically, we embed into the network a single-view visual hull using the estimated object silhouettes and poses. Embedding a visual hull can help recover more shape details by considering the projection relationship between the reconstructed 3D shape and the 2D silhouette. Since both the pose and segmentation are subject to estimation error, we opted for a “soft” visual-hull embedding strategy: we first predict a coarse 3D shape using a CNN, then employ another CNN to refine the coarse shape with the constructed visual hull. We propose a probabilistic single-view visual hull (PSVH) construction layer which is differentiable such that the whole network can be trained end-to-end.
In summary, we present a novel CNN-based approach which uses a single-view visual hull to improve the quality of shape predictions. Through our method, the perspective geometry is seamlessly embedded into a deep network. We evaluate our method on synthetic data and real images, and demonstrate that using a single-view visual hull can significantly improve the reconstruction quality by recovering more shape details and improving shape consistency with input images.
Traditional methods. Reconstructing a dense 3D object shape from a single image is an ill-posed problem. Traditional methods resort to geometry priors for the otherwise prohibitively challenging task. For example, some methods leveraged pre-defined CAD models [Sun et al.2013]. Category-specific reconstruction methods [Vicente et al.2014, Kar et al.2015, Tulsiani et al.2017a] reconstruct a 3D shape template from images of the objects in the same category as shape prior. Given an input image, these methods estimate silhouette and viewpoint from the input image and then reconstruct 3D object shape by fitting the shape template to the estimated visual hull. Our method integrates the single-view visual hull with deep neural network for reconstructing 3D shape from single image.
Deep learning for 3D reconstruction. Deep learning based methods directly learn the mapping from 2D image to a dense 3D shape from training data. For example, [Choy et al.2016] directly trained a network with 3D shape loss. [Yan et al.2016] trained a network by minimizing the difference between the silhouette of the predicted 3D shape and ground truth silhouette on multiple views. A ray consistency loss is proposed in [Tulsiani et al.2017b] which uses other types of multi-view observations for training such as depth, color and semantics. [Wu et al.2017] applied CNNs to first predict the 2.5D sketches including normal, depth and silhouette, then reconstruct the 3D shape. A reprojection consistency constraint between the 3D shape and 2.5D sketches is used to finetune the network on real images. [Zhu et al.2017] jointly trained a pose regressor with a 3D reconstruction network so that the object images with annotated masks yet unknown poses can be used for training. Many existing methods have explored using pose and silhouette (or other 2D/2.5D observations) to supervise the 3D shape prediction [Yan et al.2016, Tulsiani et al.2017b, Gwak et al.2017, Zhu et al.2017, Wu et al.2017, Tulsiani, Efros, and Malik2018]. However, our goal is to refine the 3D shape inside of the network using an estimated visual hull, and our visual hull construction is an inverse process of their shape-to-image projection scheme. More discussions can be found in the supplementary material.
Generative models for 3D shape. Some efforts are devoted to modeling the 3D shape space using generative models such as GAN [Goodfellow et al.2014] and VAE [Kingma and Welling2013]. In [Wu et al.2016]
, a 3D-GAN method is proposed for learning the latent space of 3D shapes and a 3D-VAE-GAN is also presented for mapping image space to shape space. A fully convolutional 3D autoencoder for learning shape representation from noisy data is proposed in[Sharma, Grau, and Fritz2016]. A weakly-supervised GAN for 3D reconstruction with the weak supervision from silhouettes can be found in [Gwak et al.2017].
3D shape representation. Most deep object reconstruction methods use the voxel grid representation [Choy et al.2016, Girdhar et al.2016, Yan et al.2016, Tulsiani et al.2017b, Zhu et al.2017, Wu et al.2017, Wu et al.2016, Gwak et al.2017], i.e., the output is a voxelized occupancy map. Recently, memory-efficient representations such as point clouds [Qi et al.2017], voxel octree [Häne, Tulsiani, and Malik2017, Tatarchenko, Dosovitskiy, and Brox2017] and shape primitives [Zou et al.2017] are investigated.
Visual hull for deep multi-view 3D reconstruction. Some recent works use visual hulls of color [Ji et al.2017] or learned feature [Kar, Häne, and Malik2017] for multi-view stereo with CNNs. Our method is different from theirs in several ways. First, the motivations of using visual hulls differ: they use visual hulls as input to their multi-view stereo matching networks in order to reconstruct the object shape, whereas our goal is to leverage a visual hull to refine a coarse single-view shape prediction. Second, the object poses are given in their methods, while in ours the object pose is estimated by a CNN. Related to the above, our novel visual hull construction layer is made differentiable, and object segmentation, pose estimation and 3D reconstruction are jointly trained in one framework.
In this section, we detail our method which takes as input a single image of a common object such as car, chair and coach, and predicts its 3D shape. We assume the objects are roughly centered (e.g. those in bounding boxes given by an object detector).
Shape representation. We use voxel grid for shape representation similar to previous works [Wu et al.2015, Yan et al.2016, Wu et al.2016, Zhu et al.2017, Wu et al.2017], i.e., the output of our network is a voxelized occupancy map in the 3D space. This representation is very suitable for visual hull construction and processing, and it is also possible to extend our method to use tree-structured voxel grids for more fine-grained details [Häne, Tulsiani, and Malik2017, Tatarchenko, Dosovitskiy, and Brox2017]
Camera model. We choose the perspective camera model for the 3D-2D projection geometry, and reconstruct the object in a unit-length cube located in front of the camera (i.e., with cube center near the positive Z-axis in the camera coordinate frame). Under a perspective camera model, the relationship between a 3D point and its projected pixel location on the image is
where is the camera intrinsic matrix with being the focal length and the principle point. We assume that the principal points coincide with image center (or otherwise given), and focal lengths are known. Note that when the exact focal length is not available, a rough estimate or an approximation may still suffice. When the object is reasonably distant from the camera, one can use a large focal length to strike a balance between perspective and weak-perspective models.
Pose parametrization. The object pose is characterized by a rotation matrix
and a translation vectorin Eq. 1. We parameterize rotation simply with Euler angles . For translation we estimate and a 2D vector which centralizes the object on image plane, and obtain via . In summary, we parameterize the pose as a 6-D vector .
Given a single image as input, we first apply a CNN to directly regress a 3D volumetric reconstruction similar to previous works such as [Choy et al.2016]. We call this network the V-Net. Additionally, we apply another two CNNs for pose estimation and segmentation, referred to as P-Net and S-Net respectively. In the following we describe the main structure of these sub-networks; more details can be found in the supplementary material.
, and the main difference is we replaced their LSTM layer designed for multi-view reconstruction with a simple fully connected (FC) layer. We denote the 3D voxel occupation probability map produced by the V-Net as.
P-Net: The P-Net for pose estimation is a simple regressor outputting 6-dimensional pose vector denoted as , as shown in Fig. 3
(b). We construct the P-Net structure simply by appending two FC layers to the encoder structure of V-Net, one with 512 neurons and the other with 6 neurons.
S-Net: The S-Net for object segmentation has a 2D encoder-decoder structure, as shown in Fig. 3 (c). We use the same encoder structure of V-Net for S-Net encoder, and apply a mirrored decoder structure consisting of deconv and uppooling layers. The S-Net generates an object probability map of 2D pixels, which we denote as .
Given the estimated pose and the object probability map on the image plane, we construct inside of our neural network a Probabilistic Single-view Visual Hull (PSVH) in the 3D voxel grid. To achieve this, we project each voxel location onto the image plane by the perspective transformation in Eq. 1 to obtain its corresponding pixel . Then we assign , where denotes the generated probabilistic visual hull. This process is illustrated in Fig. 3 (d).
The PSVH layer is differentiable, which means that the gradients backpropagated tocan be backpropagated to and pose , hence further to S-Net and P-Net. The gradient of with respect to is easy to discern: we have built correspondences from to and simply copied the values. Propagating gradients to
is somewhat tricky. According to the chain rule, we havewhere is the network loss. Obtaining necessitates computing , i.e., the spatial gradients of , which can be numerically computed by three convolution operations with pre-defined kernels along X-, Y- and Z-axis respectively. can be derived analytically.
With a coarse voxel occupancy probability from V-Net and the visual hull from the PSVH layer, we use a 3D CNN to refine and obtain a final prediction, denoted by . We refer to this refinement CNN as R-Net. The basic structure of our R-Net is shown in Fig. 3 (e). It consists of five 3D conv layers in the encoder and 14 3D conv layers in the decoder.
A straightforward way for R-Net to process and is concatenating and to form a 2-channel 3D voxel grid as input then generating a new as output. Nevertheless, we have some domain knowledge on this specific problem. For example, if a voxel predicted as occupied falls out of the visual hull, it’s likely to be a false alarm; if the prediction does not have any occupied voxel in a viewing ray of the visual hull, some voxels may have been missed. This domain knowledge prompted us to design the R-Net in the following manners.
First, in addition to and , we feed into R-Net two more occupancy probability maps: and where denotes element-wise product. These two probability maps characterize voxels in but not in , and voxels in but not in 111A better alternative for would be constructing another visual hull using and then compute its difference from . We choose here for simplicity.
, respectively. Second, we add a residual connection between the input voxel predictionand the output of the last layer. This way, we guide R-Net to generate an effective shape deformation to refine rather than directly predicting a new , as the predicted from V-Net is often mostly reasonable (as found in our experiments).
We now present our training strategies, including the training pipeline for the sub-networks and their training losses.
Training pipeline. We employ a three-step network training algorithm to train the proposed network. Specifically, we first train V-Net, S-Net and R-Net separately, with input training images and their ground-truth shapes, silhouettes and poses. After V-Net converges, we train R-Net independently, with the predicted voxel occupancy probability from V-Net and the ground-truth visual hull, which is constructed by ground-truth silhouettes and poses via the PSVH layer. The goal is to let R-Net learn how to refine coarse shape predictions with ideal, error-free visual hulls. In the last stage, we finetune the whole network, granting the subnets more opportunity to cooperate accordingly. Notably, the R-Net will adapt to input visual hulls that are subject to estimation error from S-Net and P-Net.
Training loss. We use the binary cross-entropy loss to train V-Net, S-Net and R-Net. Concretely, let be the estimated probability at location in either , or , then the loss is defined as
where is the target probability (0 or 1). traverses over the 3D voxels for V-Net and R-Net, and over 2D pixels for S-Net. The P-Net produces a 6-D pose estimate as described before. We use the regression loss to train the network:
where the Euler angles are normalized into . We found in our experiments the loss produces better results than an loss.
Our network is implemented in TensorFlow. The input image size isand the output voxel grid size is . Batch size of 24 and the ADAM solver are used throughout the training. We use a learning rate of for S-Net, V-Net and R-Net and divide it by 10 at the 20K-th and 60K-th iterations. The learning rate for P-Net is and is dropped by 10 at the 60K-th iteration. When finetuning all the subnets together the learning rate is and dropped by 10 at the 20K-th iteration.
Training and testing data. In this paper, we test our method on four common object categories: car and airplane as the representative vehicle objects, and chair and couch as furniture classes. Real images that come with precise 3D shapes are difficult to obtain, so we first resort to the CAD models from the ShapeNet repository [Chang et al.2015]. We use the ShapeNet object images rendered by [Choy et al.2016] to train and test our method. We then use the PASCAL 3D+ dataset [Xiang, Mottaghi, and Savarese2014] to evaluate our method on real images with pseudo ground truth shapes.
The numbers of 3D models for the four categories are 7,496 for car, 4,045 for airplane, 6,778 for chair and 3,173 for table, respectively. In the rendering process of [Choy et al.2016], the objects were normalized to fit in a radius-0.5 sphere, rotated with random azimuth and elevation angles, and placed in front of a 50-degree FOV camera. Each object has 24 images rendered with random poses and lighting.
Following [Choy et al.2016], we use 80% of the 3D models for training and the rest 20% for testing. We train one network for all the four shape categories until the network converge. The rendered images are with clean background (uniform colors). During training, we blend half of the training images with random crops of natural images from the SUN database [Xiao et al.2010]
. We binarize the output voxel probability with thresholdand report Intersection-over-Union (IoU).
|Refine. w. GT||0.869||0.701||0.592||0.741||0.726|
|Refine. w/o 2 prob.maps||0.840||0.610||0.549||0.701||0.675|
|Refine. w/o end-to-end||0.822||0.593||0.542||0.677||0.658|
|[Fan, Su, and Guibas2017]||0.831||0.601||0.544||0.708||0.671|
Quantitative results. The performance of our method evaluated by IoU is shown in Table 1. It shows that the results after refinement (i.e., our final results) are significantly better, especially for airplane and chair where the IoUs are improved by about 16% and 10%, respectively. Note that since our V-Net is adapted from [Choy et al.2016] as mentioned previously, the results before refinement can be viewed as the 3D-R2N2 method of [Choy et al.2016] trained by us.
To better understand the performance gain from our visual hull based refinement, we compute the IoU of the coarse and refined shapes for each object from the four categories. Figure 5 presents the comparisons, where the object IDs are uniformly sampled and sorted by the IoUs of coarse shapes. The efficacy of our refinement scheme can be clearly seen. It consistently benefits the shape reconstruction for most of the objects, despite none of them is seen before.
We further compare the numerical results with PointOutNet [Fan, Su, and Guibas2017] which was also evaluated on this rendered dataset and used the same training/testing lists as ours. Table 1 shows that our method outperformed it on the three of the four categories (car, airplane and chair) and obtained a higher mean IoU over the four categories. Note that the results of [Fan, Su, and Guibas2017] were obtained by first generating point clouds using their PointOutNet, then converting them to volumetric shapes and applying another 3D CNN to refine them.
Table 3 compares the results of our method on test images with clean background and those blended with random real images. It shows that with random real image as background the results are only slightly worse. Table 4 shows the quality of the pose and silhouette estimated by P-Net and S-Net.
Qualitative results. Figure 4 presents some visual results from our method. It can be observed that some object components especially thin structures (e.g. the chair legs in the second and fifth rows) are missed in the coarse shapes. Moreover, we find that although some coarse shapes appear quite realistic (e.g. the airplanes in the left column), they are clearly inconsistent with the input images. By leveraging the single-view visual hull for refinement, many shape details can be recovered in our final results, and they appear much more consistent with the input images.
We also compare our results qualitatively with MarrNet [Wu et al.2017], another state-of-the-art single-view 3D object reconstruction method222We were not able to compare the results quantitatively: MarrNet directly predicts the shapes in the current camera view which are not aligned with GT shapes; moreover, the training and testing splits for MarrNet are not disclosed in [Wu et al.2017].. The authors released a MarrNet model trained solely on the chair category of the ShapeNet objects. Figure 6 presents the results on four chair images, where the first/last two are relatively good results from MarrNet/our method cherry-picked among 100 objects on our test set. It can be seen that in both cases, our method generated better results than MarrNet. Our predicted shapes are more complete and consistent with the input images.
We now evaluate our method on real images from the PASCAL 3D+ dataset [Xiang, Mottaghi, and Savarese2014]. This dataset only have pseudo ground-truth shapes for real images, which makes it very challenging for our visual hull based refinement scheme. Moreover, the provided object poses are noisy due to the lack of accurate 3D shapes, making it difficult to train our pose network.
To test our method on this dataset, we finetune our network trained on ShapeNet objects on images in PASCAL 3D+. We simply set the focal length to be for all images since no focal length is provided. With this fixed focal length, we recomputed the object distances using the image keypoint annotations and the CAD models through reprojection error minimization. Due to space limitation, more details are deferred to the supplementary material.
Quantitative results. The quantitative results of our method are presented in Table 5 and Table 6. As can be seen in Table 6, the pose and silhouette estimation errors are much higher than the results on the ShapeNet objects. Nevertheless, Table 5 shows that our visual hull based refinement scheme still largely improved the coarse shape from V-Net for the car, airplane and couch categories. Note again that our V-Net is almost identical to the network in the 3D-R2N2 method [Choy et al.2016]. The refinement only yields marginal IoU increase for the chair category. We observed that the chair category on this dataset contains large intra-class shape variations (yet only 10 CAD shapes as pseudo ground truth) and many instances with occlusion; see the suppl. material for more details.
Qualitative results. Figure 7 shows some visual results of our method on the test data. It can be seen that the coarse shapes are noisy or contain erroneous components. For example, possibly due to the low input image quality, the coarse shape prediction of the car image in the second row of the left column has a mixed car and chair structure. Nevertheless, the final results after the refinement are much better.
Performance of refinement without visual hull. In this experiment, we remove the probabilistic visual hull and train R-Net to directly process the coarse shape. As shown in Table 1, the results are slightly better than the coarse shapes, but lag far behind the results refined with visual hull.
Performance of refinement with GT visual hull. We also trained R-Net with visual hulls constructed by ground-truth poses and silhouettes. Table 1 shows that the performance is dramatically increased: the shape IoU is increased by up to 30% from the coarse shape for the four object categories. The above two experiments indicate that our R-Net not only leveraged the visual hull to refine shape, but also can work remarkably well if given a quality visual hull.
Effect of two additional occupancy probability maps and . The results in Table 1 shows that, if these two additional maps are removed from the input of R-Net, the mean IoU drops slightly from 0.680 to 0.675, indicating our explicit knowledge embedding helps.
Effect of end-to-end training. Table 1 also presents the result without end-to-end training. The clear performance drop demonstrates the necessity of our end-to-end finetuning which grants the subnets the opportunity to better adapt to each other (notably, R-Net will adapt to input visual hulls that are subject to estimation error from S-Net and P-Net).
Performance w.r.t. pose and silhouette estimation quality. We find that the performance gain from refinement decreases gracefully w.r.t. rotation estimation error, as shown in Fig. 8 (left). One interesting phenomenon is that, as shown in Fig. 8 (middle), the best performance gain is not from best silhouette estimates. This is because larger and fuzzier silhouette estimates may compensate the 2D-to-3D correspondence errors arisen due to noisy pose estimates.
Performance w.r.t. rotation angle. Figure 8 (right) shows that the performance gain from refinement is high at 30 and 330 degrees while low near 0 and 180 degrees. This is easy to discern as frontal and real views exhibit more self-occlusion thus the visual hulls are less informative.
Sensitivity w.r.t. focal length. We conducted another experiment to further test our method under wrong focal lengths and distorted visual hulls. The results indicate that our method still works well with some weak-perspective approximations and the results are insensitive to the real focal lengths especially for reasonably-distant objects. The details can be found in the suppl. material.
For a batch of 24 input images, the forward pass of our whole network takes 0.44 seconds on an NVIDIA Tesla M40 GPU, i.e., our network processes one image with 18 milliseconds on average.
We have presented a novel framework for single-view 3D object reconstruction, where we embed the perspective geometry into a deep neural network to solve the challenging problem. Our key innovations include an in-network visual hull construction scheme which connects the 2D space and pose space to the 3D space, and a refinement 3D CNN which learns shape refinement with visual hulls. The experiments demonstrate that our method achieves very promising results on both synthetic data and real images.
Limitations and future work. Since our method involves pose estimation, objects with ambiguous pose (symmetric shapes) or even do not have a well-defined pose system (irregular shapes) will be challenging. For the former cases, using a classification loss to train the pose network would be a good remedy [Su et al.2015], although this may render the gradient backpropagation problematic. For the latter, one possible solution is resorting to multi-view inputs and train the pose network to estimate relative poses.
European Conference on Computer Vision Workshop on Geometry Meets Deep Learning, 236–250.
Sun database: Large-scale scene recognition from abbey to zoo.In CVPR, 3485–3492.
3D-PRNN: generating shape primitives with recurrent neural networks.In ICCV, 900–909.