With the growing application demand of robot-object manipulation and 3D printing, automatic and efficient 3D model reconstruction from 2D images has recently been a hot topic in the research field of computer vision. Classic 3D reconstruction methods, based on the Structure-from-Motion technology[Halber and Funkhouser2017, Snavely et al.2006], are usually limited to the illumination condition, surface textures and dense views. On the contrary, benefited from prior knowledge, learning-based methods [Liu et al.2017, Yan et al.2016] can utilize a small number of images to reconstruct a plausible result without the assumptions on the object reflection and surface textures.
Regarding object reconstruction as a predictive and generative issue from a single image, learning based methods usually utilize CNN-based encoder and decoder to predict a 3D volume [Girdhar et al.2016, Dai et al.2017, Wu et al.2016b, Kar et al.2015], 3D structure [Wu et al.2016a] or point set [Fan et al.2017] trained by 3D supervision. Recent work [Yan et al.2016, Tulsiani et al.2017] attempts to use only 2D images for 2D supervision to train image-based object reconstruction models. For example, with a differentiable style, Yan et al. yan2016perspective propose Perspective Transformer Nets with a novel projection loss that enables the 3D learning using 2D projection without 3D supervision. Tulsiani et al. tulsiani17multi study the consistence between a 3D shape and 2D observations and propose a differentiable formulation to train 3D prediction by 2D observations.
A crucial assumption in the above-mentioned models, however, is that the input images contain most information of a 3D object. As a result, these models fail to make a reasonable prediction when the observation has severe self-occlusion as they lack the information from other views. An effective solution is to utilize more views to make up the information. Choy et al. choy20163d propose a 3D Recurrent Neural Networks (3D-R2N2) to map multiple random views to their underlying 3D shapes. In contrast to 3D-R2N2, we focus on how much information is sufficient and how to aggregate these information for 3D object reconstruction. In other words, how many views and which views of image can capture the most informative feature and maximize the quality of reconstruction? This is a problem about dynamical view prediction when reconstructing an object. It means an active process of capturing new information for 3D reconstruction.
There are many methods receiving the maximal information gain such as the decrease of uncertainty [Xu et al.2015a], Monte Carlo sampling [Denzler and Brown2002] and Gaussian Process Regression [Huber et al.2012]. Some methods regard this information gain task as a sequential prediction of the next best view (NBV) by reducing scanning effort [Wu et al.2014] or uncertainty of object [Wu et al.2015] with the least observations. All these attempt to receive the maximal information gain with the minimal number of views. Intuitively, it is not absolute that the best performance comes from the maximal information. The reason is perhaps that learning based methods attempt to exploit the spatial and temporal structure of the sequential observations [Xu et al.2016]
and learn to predict an approximate views sequence to optimize the deviation between the prediction and the ground truth. The attention model[Mnih et al.2014] is a good solution for the problem of sequential locations prediction as it utilizes the recurrent neural network to extract the information from a single image or an image sequence, and adaptively selects a sequence of discriminative regions or locations. There are many successful applications using the attention model such as image classification [Xiao et al.2015], image captioning [Xu et al.2015b] and 3D shape recognition [Xu et al.2016].
However, different from image identification building the inconsistency between predicted category labels and ground truth, object reconstruction requires a dense prediction for each voxel, and it thus needs to explore a deeper relation between 3D volume and 2D images and to use this relation to guide the aggregation of multi-view information and the planning of sequential views. To achieve this, our method differs from the other attention-based models in two major aspects. First, to constrain the consistency between 3D volume and 2D images, we combine the volumetric and projective supervision in the process of view aggregation. Second, for guided view planning, our reward is set upon the performance of reconstruction and volume-projection consistency, facilitating the view planner to capture more information.
Our experiments show that our model aggregates more discriminative information from multi-view images and apparently increases the accuracy with an increasing number of views. For the view planning task, we demonstrate our sequences can give better prediction than other strategies in the test. The main contributions of this paper are as follows: (1) We build a Recurrent Encoder-Decoder based on multiple Conv-RNN layers and a volume-projection supervision, leading to a better reconstruction performance. (2) We combine 3D volume prediction and 2D projection to design the reward for view planning policy learning. Under the control of the combined reward, we can implicitly learn the deep relationship between 3D reconstruction and 2D images, and optimize the planning policy. (3) We propose an active framework that learns a view planner for 3D object reconstruction. Our model can dynamically determine view selection based on information gain and discrimination, which makes the reconstruction more accurate.
Figure 1 illustrates our model, which is summarized as follows. In the training stage, starting at a random view in the viewing circle (Figure 1 (a)), we select the pre-rendered color image with its associated view close to the random view and feed it into a Recurrent 2D Encoder, which encodes the image to a latent unit and propagates the information when absorbing new views. The encoded unit is then fed into two branch pathways. One is a Recurrent 3D Decoder, which maps the latent unit extracted from all past views to a predicted 3D volume. The other one called View Planner serves as a dynamical view prediction module that continually receives and integrates the encoded units from current and all past views (Figure 1 (b)), and regresses view parameters for next observation (Section 2.2). To combine view planning and object reconstruction into a unified and correlative model, we propose a volume-projection guidance (Section 2.3
) for the supervised learning of view-based volume mapping and reinforcement learning of continuous view prediction. The volume comes from the Recurrent Encoder-Decoder and the projection is generated by a differentiablePerspective Transformer [Yan et al.2016]. In the test stage, our model keeps acquiring the images observed from a target object under the guidance from the View Planner and reconstructs the 3D model with the Recurrent Encoder-Decoder.
2.2 Network Architecture
We consider multi-view volumetric reconstruction as a dense prediction from a sequence of views and develop a unified framework for both volume reconstruction and view planning in an active scheme. The procedure can be divided into multiple time steps and we plot one step of data flow in Figure 2. Next we discuss the detailed architecture.
Recurrent 2D Encoder. This network is utilized to extract features from input image and aggregate them with the past views to a latent unit: , where refers to all past states of hidden layers in the encoder network. In our implementation, we build a Recurrent 2D Encoder upon multiple 2D Conv-GRU layers to extract the spatial features from images and integrate the sequential past states. 2D Conv-GRU layers update their hidden states under the control of three convolution-operator gates with the hidden states arranged in 2D space. Compared to convolutional layers based auto-encoder networks with a single 3D-GRU, our network is better at feature extraction on image sequence, since our reconstruction performance improves a large margin (demonstrated in Section 3.4).
Recurrent 3D Decoder. Taking features extracted by the encoder network as input, Recurrent 3D Decoder utilizes multiple 3D Conv-GRU layers, which are similar to 2D Conv-GRU layers but arranged in 3D space, to decode to a 3D volume where each voxel grid retains the probability of occupancy: , where
refers to all past states of hidden layers in the decoder network. To increase the resolution of feature maps, we add a voxel shuffle layer after each 3D Conv-GRU layer, which allocates the depth dimension of feature vector to 3D space.
View Planner. The task of view planning is to actively regress a sequence of views parameterized as camera azimuth angles on a viewing circle around the object. Taking a random angle as initial view, we sequencely feed the rendered image under the current view into the Recurrent 2D Encoder to extract and aggregate features. We then merge the features with the current view parameters by element-wise multiplication to get a viewing “glimpse” [Mnih et al.2014], which fuses the information of both the image sequence and current view: . With a GRU layer, our model retains all past glimpse information and continuously absorbs new views. The glimpse information can be formulated as , which discriminatively describes the relation of images sequence, 3D volume and parameters of a sequence of views. Feeding the states to an extra full connection layer after the GRU layer, we finally predict view parameters of the next view to get the next image input.
We use the Perspective Transformer Network proposed by Yan et al. yan2016perspective to obtain a 2D projection from the 3D volume. Utilizing this 3D differentiable transformation, we project the 3D voxel prediction to a2D grid, which looks like a projection silhouette. Combining this differentiable 2D projection with the predicted volume, we build a projective guidance on the training of volume prediction and view planning (see Section 2.3 for details).
2.3 Volume-Projection Guidance
We combine the procedure of object reconstruction and view planning into a unified framework. For view planning, it optimizes a view prediction policy under the control of feedback signals based on the evaluation of reconstruction performance (as shown in Figure 1 (b)). For object reconstruction, the Recurrent Encoder-Decoder receives input images from the informative views predicted by the View Planner, which ensures a sufficient information gain and boosts the improvement of reconstruction performance. We jointly train two modules by both volumetric and projective patterns but use different strategies: a reinforcement learning under the control of volume-projection reward and a supervised learning using volume-projection supervision.
Volume-Projection Reward. At each time step, the guidance of View Planner comes from the performance of volume predicted by the reconstruction module. In other words, the View Planner receives a reward signal which is built upon the reconstruction feedback from the Recurrent Encoder-Decoder. We only calculate the accumulative reward during a whole episode to update a view planning policy, which maps the image observation to the camera view. To accommodate the dense prediction on the whole 3D volume and ensure the reconstruction improvement with new views fed in, the increment of voxel Intersection-over-Union (IoU) is utilized to measure the reconstruction reward. Mathematically, the reward at step t can be formulated as follows:
where is the 3D volume predicted by the Recurrent Encoder-Decoder, and is the corresponding ground truth in the dataset.
To implicitly learn the relation between 3D volume and 2D projection, we design a projection reward to encourage the consistence between 3D construction and 2D projection from different viewpoints. The projection reward is defined as the increment of pixel IoU value on 2D silhouettes sampled by the Perspective Transformer from multiple different views. The reward at step t can be formulated as follows:
where is the number of projection views and is the view.
In addition, to punish for selecting similar views, we add an additional movement cost defined as the minimum value of the circle distance between the current location and past views. Integrating the reconstruction reward, projection reward, and movement cost, the final reward is defined as follows:
where , and are the weights of the reconstruction reward, the projection reward, and the movement cost, respectively.
Using this reward, we can control the update of policy, which corresponds to the gradient policy algorithm. We sample the views predicted by the View Planner according to a normal distribution with a predefined standard deviation at each time step, and minimize the following loss function to optimize the view planning policy:
where is a sampled view at time step , is the reward at time t, and is a predicted value as a various baseline which is utilized to center the reward ([Mnih et al.2014]). The log probability can derivative by back propagation of network and reward is the signal received from the reconstruction feedback of Recurrent Encoder-Decoder.
Volume-Projection Supervision. The loss function of Recurrent Encoder-Decoder is defined as the mean value of voxel-wise square error (MSE):
where is the final output of the Recurrent Encoder-Decoder.
Besides the 3D volumetric loss, we add a 2D projective loss to implicitly learn the effect of 2D projection on 3D prediction, which improves the multi-view reconstruction performance. The 2D supervision loss is formulated as:
where is the Perspective Transformer Network, is the parameters of view with a -by- transformation matrix, and is the projection of the ground truth voxel. We combine the 3D supervision (Equation 5) and 2D supervision (Equation 6) using a weighted sum as:
where and are the weights of volumetric loss and perspective loss, respectively.
In this section, we discuss the following three questions: (1) Can our model improve the accuracy of reconstruction with an increasing number of views? (Section 3.2) (2) Can our View Planner obtain more informative and discriminative views to boost the reconstruction performance compared to the other alternative methods? (Section 3.3) (3) Do our network structures learn better than other settings? (Section 3.4)
3.1 Implementation Details
Our model is trained and tested under the Pytorch framework, accelerated by a GPU (NVIDIA GTX 1080Ti). We use the dataset from[Yan et al.2016], which is based on the ShapeNetCore [Wu et al.2015]. Each model is represented as a 3D volume of from its canonical orientation, and images are rendered from azimuth angles with elevation angle. For each rendered image, we cropped and resized the centered region to pixels with channels (RGB). We initialized all the weights using Xavier [Glorot and Bengio2010]
and update the weights by using ADAM solver with batchsize 16, epoch 200,.
3.2 Evaluation on Reconstruction Performance
We compare our method with PTN (Perspective Transformer Network [Yan et al.2016]), OGN (Octree Generating Networks [Tatarchenko et al.2017]) and 3D-R2N2 (proposed by Choy et al. choy20163d). PTN uses an encoder-decoder model to make a 3D volume prediction trained with a combined loss of both projection supervision and volume supervision. OGN generates volumetric 3D outputs in a compute- and memory-efficient manner by using an octree representation. To evaluate the multi-view performance, we also compare to 3D-R2N2, which performs both single- and multi-view 3D reconstruction using a 3D recurrent network. We trained and tested our network using 13 categories with train/test data split used by 3D-R2N2’s authors, which is adopted by OGN’s author as well. For a fair comparison, we followed 3D-R2N2’s setting and used 5 random views along the view circle to evaluate our Recurrent Encoder-Decoder model. For PTN, We re-trained the model for multi-category reconstruction using the code released by the authors, since they originally trained their model only on chair category. In the test stage, we compute voxel IoU (1
with threshold 0.4 as the evaluation metric.
Overall results. In Figure 3 (a), we plot the trend of mean reconstruction IoU by the compared methods. It can be seen that our method performs better than the baseline volumetric reconstruction methods PTN and OGN when using only single view and outperform it a large margin with an increasing number of views. It proves the ability of our model predicting a reasonable reconstruction result by using only a single image. Compared to 3D-R2N2, we get a significantly better reconstruction performance over the number of views , as confirmed by a two-sided t-test (p-value
, as confirmed by a two-sided t-test (p-value0.01). The reason is perhaps that our Recurrent Encoder-Decoder extracts more discriminative features and aggregates the information from different views at a deeper level.
Per-category results. We also examine the reconstruction performance of the compared methods on 13 categories as shown in the table of Figure 3 (b). Our model leads to higher IoUs with an increasing number of views and performs the best when using 5 views. Besides, we observe that our model does consistently better in single-view reconstruction than PTN and 3D-R2N2 as well. This may be benefited from our volume-projection guidance.
Qualitative results. The examples of reconstruction results shown in Figure 5 qualitatively show that our model can generally make a reasonable prediction of a 3D object on a global shape even from a single view and succeed to optimize the local details that 3D-R2N2 fails (pointed out by the red circles) when using more information from different views.
3.3 Evaluation on Information Gain
To evaluate the performance of view prediction, we compared our View planner against two baselines and an alternative method. The two baselines consist of a random planners that selects random view as the next view and a farthest planners that selects the view which is the farthest away from previous views in the viewing circle around the targeting object. The alternative methods is the NBV technique proposed in ShapeNet [Wu et al.2015]
which estimates the information gain of a view from 3D volume. We train and test our active reconstruction model using the chair models in ShapeNet and rendering images under the train/test data split used by PTN’s authors[Yan et al.2016]. We set , , . For comparison, we respectively feed the image sequence predicted by these strategies into the pre-trained Recurrent Encoder-Decoder to reconstruct the 3D volume.
We plot the IoU values and the decrease of Shannon Entropy over the number of views in Figure 4. Compared to other methods, our model not only attains more information but also gets a more accurate results, showing that our model is able to predict a both informative and discriminative view sequence for more accurate reconstruction results.
3.4 Network Structures Comparison
To demonstrate that our Recurrent Encoder-Decoder extracts a more discriminative feature from an image sequence and gets a better reconstruction performance, we compare four kinds of network architectures under different combinations: fully convolutional Encoder-Decoder with a 3D RNN (2E-R-3D), Recurrent 2D Encoder with a 3D CNN-based decoder (R2E-3D), 2D CNN-based Encoder with a Recurrent 3D Decoder (2E-R3D), and our Recurrent 2D Encoder with Recurrent 3D Decoder (R2E-R3D). We trained all these four models on the chair category using the rendered images from random views and ground truth 3D volume under the train/test data split of ShapeNet database used by PTN’s authors. For comparison, we utilize 5 random views to reconstruct the 3D volume and show in Table 1 the results of MSE loss and IoU values. The results show that our R2E-R3D architecture performs the best on both training losses and testing IoU values. Using R2E-R3D model, we can achieve the best reconstruction performance against the other settings, which validates our model superior in view-based reconstruction task.
|Encoder||2D Enc||R-2D||2D Enc||R-2D|
|Decoder||3D Dec||3D Dec||R-3D||R-3D|
In this paper, we have presented an learning-based model with active perception which unifies the guided information acquisition and multi-view object reconstruction. Under the guidance from both volume and projection, we jointly train the Recurrent Encoder-Decoder and View Planner for active object reconstruction. Experiments demonstrate that our model obtains more information and increases the reconstruction performance with an increasing number of views. Our model only extracts the semantic features but ignores the correspondence of geometrical features from different camera viewpoints, leading to a slow growth when feeding in more than 5 views. In the future, we would utilize multi-modal features to optimize or jointly learn the object reconstruction and utilize more efficient data representations to increase the output resolution. Besides, it is interesting to extend our approach to multi-object reconstruction by predicting the transformation of camera view from one object to another one.
We thank the anonymous reviewers for the insightful and constructive comments. The work was partially funded by the Research Grants Council of HKSAR, China (Project No. CityU 11237116 and CityU 11300615), ACIM-SCM, the Hong Kong Scholars Program, and by NSFC grant from National Natural Science Foundation of China (NO. 91748104, 61632006, 61425002, U1708263).
- [Choy et al.2016] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European Conference on Computer Vision, 2016.
- [Dai et al.2017] Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Computer Vision and Pattern Regognition, 2017.
- [Denzler and Brown2002] J Denzler and C Brown. An information theoretic approach to optimal sensor data selection for state estimation. Pattern Analysis and Machine Intelligence, 2002.
- [Fan et al.2017] Haoqiang Fan, Hao Su, and Leonidas Guibas. A point set generation network for 3d object reconstruction from a single image. In Computer Vision and Pattern Regognition, 2017.
- [Girdhar et al.2016] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative vector representation for objects. In European Conference on Computer Vision, 2016.
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
[Halber and Funkhouser2017]
Maciej Halber and Thomas Funkhouser.
Fine-to-coarse global registration of rgb-d scans.
Computer Vision and Pattern Recognition, 2017.
- [Huber et al.2012] Marco F Huber, Tobias Dencker, Masoud Roschani, and Jürgen Beyerer. Bayesian active object recognition via gaussian process regression. In Information Fusion, 2012.
- [Kar et al.2015] Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-specific object reconstruction from a single image. In Computer Vision and Pattern Recognition, 2015.
- [Liu et al.2017] Xiaobai Liu, Yadong Mu, and Liang Lin. A stochastic image grammar for fine-grained 3d scene reconstruction. In International Joint Conferences on Artifical Intelligence, 2017.
- [Mnih et al.2014] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, 2014.
- [Snavely et al.2006] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3d. ACM Transactions on Graphics, 2006.
- [Tatarchenko et al.2017] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In International Conference on Computer Vision, 2017.
- [Tulsiani et al.2017] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Computer Vision and Pattern Regognition, 2017.
- [Wu et al.2014] Shihao Wu, Wei Sun, Pinxin Long, Hui Huang, Daniel Cohen-Or, Minglun Gong, Oliver Deussen, and Baoquan Chen. Quality-driven poisson-guided autoscanning. ACM Transactions on Graphics, 2014.
- [Wu et al.2015] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Computer Vision and Pattern Recognition, 2015.
- [Wu et al.2016a] Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William T Freeman. Single image 3d interpreter network. In European Conference on Computer Vision, 2016.
- [Wu et al.2016b] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, 2016.
[Xiao et al.2015]
Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng
The application of two-level attention models in deep convolutional neural network for fine-grained image classification.In Computer Vision and Pattern Recognition, 2015.
- [Xu et al.2015a] Kai Xu, Hui Huang, Yifei Shi, Hao Li, Pinxin Long, Jianong Caichen, Wei Sun, and Baoquan Chen. Autoscanning for coupled scene reconstruction and proactive object analysis. ACM Transactions on Graphics, 2015.
[Xu et al.2015b]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, and Yoshua Bengio.
Show, attend and tell: Neural image caption generation with visual
International Conference on Machine Learning, 2015.
- [Xu et al.2016] Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or, and Baoquan Chen. 3d attention-driven depth acquisition for object identification. ACM Transactions on Graphics, 2016.
- [Yan et al.2016] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in Neural Information Processing Systems, 2016.