This paper studies the structured prediction problem of regressing unordered point sets with implicit and often ambiguous input spaces. A concrete instance which embodies this type of problem is 3D object geometry reconstruction (3DR) from single-view images for partial shape guidance firman2016structured
. The ambiguity arises from the fact that 3D-to-2D mapping is not invertible and large portions of the object features are typically occluded. 3DR is a pivotal learning problem in visual understanding, with numerous applications across domains. For instance, an intelligent robot requires a 3D model of the object instance to reason about manipulation. Similarly, in augmented reality recognizing the 3D shapes of often unseen objects in the world is necessary for both correct rendering and interaction. 3DR has been explored in a large body of extant work in computer vision, for problems such as structure from motionsfm; slam or multiview stereo Furukawa_etal_2010; Furukawa_Ponce_2010; Goesele_etal_2010; Hernandez_Vogiatzis_2010; lhuillier2005quasi; agarwal2009building; engel2014lsd and at times even with single view images Criminisi00a. Ingenious work on “Shape-from-X" has utilized priors on natural images to infer geometric features, with “X" being shading, texture, specularity, shadow and so on zhang1999shape; malik1997computing; healey1988local; savarese20073d; BarronTPAMI2015. Most of the aforementioned methods require carefully constructed features, a problem that is addressed by data-driven methods relying on large-scale 3D object datasets DBLP:journals/corr/ChangFGHHLSSSSX15; xiang2016objectnet3d. Data-driven methods learn implicit priors for various object recognition tasks such as shape completion and 3D reconstruction. Broadly, these methods use the prior knowledge in two ways: (i) image-based shape retrieval that focuses on algorithm design to find the nearest shape in database for the query image xiang2014beyond; DBLP:journals/corr/ChangFGHHLSSSSX15; rock2015completing, and (ii) deep generative models which operate directly on the query image and generate a 3D reconstruction as output, matching the shape distribution but resulting in different shape instances than in the database choy20163d; fan2016point; rezende2016unsupervised; wu2016single; yan2016perspective; drcTulsiani17; wsgan
. We note that the majority of recent methods for 3DR resort to either direct volumetric representation (aka voxels) or meshes from multi-view images as their shape representation. While intuitive, these representations can be both computationally inefficient and ineffective in capturing the natural invariance of 3D shapes under geometric transformations. A recent method by Fan et al. uses Point Set Generation Network (PSGN) to alleviate these problems. As they note, a point cloud is a simple, uniform structure that is relatively easier to learn than voxels, as it does not have to encode multiple primitives or combinatorial connectivity patterns. Additionally, point clouds are computationally superior to voxels since they do not require 3D convolutions and are amenable to direct manipulation when it comes to shape deformation and transformation. However, though direct prediction of fixed size point clouds improves 3DR performance, giving up on connectivity can result in a lack of fine shape features since loss functions are focused on overall reconstruction. At the same time, Spatial Transformer Networks (STN) have presented an appealing method to learn geometric transformations in 2D imagesjadenberg2015stn. STNs are a modular, differentiable and dynamic upgrade to pooling operations that learn to zoom in and eliminate background clutter, thereby “standardizing” the input. However, they have primarily been studied in the context of discrete grids in image inputs to facilitate image classification. Inspired the ideas from PSG and STN, we propose DeformNet- a model that extends STN style geometric operations to 3D using the notion of Free-Form Deformations (FFD). When used in conjunction with a point cloud representation, this method not only benefits from computational efficiency but also can preserve fine details in shapes since it implicitly preserves connectivity in structures. DeformNet uses a single image input to first perform shape retrieval from an object dataset using a learned image-to-shape embedding, and then deforms the point cloud representation of the retrieved template using the FFD layer in an encoder-decoder style network architecture as depicted in Figure 1. DeformNet intuitively builds on the strengths of Shape Retrieval, PSG, and STNs while compensating for their shortcomings. We implement FFD as a differentiable layer for end-to-end training along with point set correspondence based loss functions Chamfer Distance and Earth-Mover’s Distance, as in fan2016point. To summarize, the main contributions of this paper are:
Introduction of Free Form Deformation as a differentiable layer to be used as a new building block that enables 3D data manipulation.
A novel framework, DeformNet based on FFD layer to achieve smooth geometric deformations in point clouds for 3D Reconstruction.
Evaluation of DeformNet on rendered images achieves state-of-the-art performance in comparison to both Point Prediction Methods such as PSG fan2016point and Generative Models such as 3D-R2N2 choy20163d.
2 Related Work
Generative Models for 3D Reconstruction.
Recently, generative models for 3D reconstruction have produced state-of-the-art results. One approach is targeting voxel reconstruction through a 3D voxel neural network.choy20163det al. tlnet proposed a TL-embedding network which embeds image and shape together for single view 3D volumetric shape generation. Wu et al. 3dgan proposed a 3D VAE-GAN which brings the two popular generative models together in volumetric shape generation and reconstruction. There are also representative advances in unsupervised/weakly supervised 3D learning for reconstruction. yan2016perspective proposed a perspective transformer net for reconstruction from a single image which only uses images contour as supervision. rezende2016unsupervised proposed a conditional generative network for unsupervised reconstruction from images. All of the above neural network based 3D reconstruction methods use voxel representations, which requires a large amount of memory and is inefficient due to the small space usage of generic 3D shapes. Fan et al. fan2016point
proposed an alternative approach with neural networks that output unordered 3D point sets for 3D reconstruction. In our work, we combine both voxel and point set representations by using a 3D convolutional neural network to generate a deformation that is applied to 3D points sets for output which can preserve fine detail in the template shape.Spatial Transformer Networks Conventional convolutional neural networks lack the ability to warp or select a patch from an image that is relevant to the target task, which leads to an unnecessarily larger and deeper network for larger images. To alleviate this issue, Jadenberg et al. jadenberg2015stn proposed the Spatial Transformer Networks (STN) that can apply geometric transformations to an input or activations such as Affine Transformation or Thin-Plane-Spline to extract a patch that is relevant to the task. Kanazawa et al. warpnet proposed a WarpNet, an unsupervised method for deforming an image using a Spatial Transformer Network. This work is similar in spirit to the WarpNet in that the neural network generates deformation parameters and the loss is defined using the deviation of the deformed input to the ground truth. However, unlike the WarpNet, we condition the network on two inputs (the reference image and a template shape) and use it for 3D shape deformation and reconstruction. 3D Shape Deformation Recently, Yumer and Mitra yumer2016learning proposed a 3D Convolutional Neural Network that generates a deformation field as an output for 3D mesh editing given a user input. We employ the same deformation representation, Free-Form Deformation (FFD) to generate 3D deformation field. The primary difference between our approach and that of Yumer et al. are that they supervise the network using precomputed deformation offsets while we propose an end-to-end trainable network which only requires the target shape as the sole source of supervision. We accomplish this by computing distance between a set of points and minimizing it with respect to the deformation field (Sec. 3.4). In addition, unlike Yumer et al., we focus on the 3D reconstruction given an image input, rather than 3D editing. Huang et al. huang2015single also proposed an approach to 3D reconstruction through template retrieval and deformation, which relies on jointly segmenting the 2D image and 3D templates and creating the 3D reconstruction from deformed parts of the segmented templates. Their approach differs from ours in that it relies on having noisy segmentation of the shapes in their database, in that it performs reconstruction by segmenting both the input image and templates and deforming template parts to fit with the parts of the segmented image via direct optimization, and in that it takes significantly longer to run due to the optimizations it employs. To summarize, voxelized representations suffer from memory inefficiency and difficulty in generating fine details, previous deformation networks require the limiting supervision of point displacements, and previous work on reconstruction through template deformation relied on noisy segmentation in the shapes database. Purely generative models don’t leverage the high quality and broad availability of shape databases. Our paper finds a balance between the flexibility of a generative approach and the output quality of a database-focused approach. It does this without requiring any supervision beyond that of the desired shape. This addresses the major problems with the current best methods.
In this section, we propose the DeformNet framework that generates a 3D shape reconstruction from a single image. Unlike recent single-view 3D reconstruction works that use Convolutional Neural Networks, we do not use voxelized output choy20163d; tlnet; 3dgan, which cannot recover fine details due to the coarse resolution of the voxel grid, and do not directly generate a point cloud from scratch fan2016point. Instead, we propose an end-to-end network that deforms a template 3D shape to match an input image and preserves the topology of the template shape and train it using the target shape only. This can be seen as combining the key insights from prior work that used Free-Form Deformation in a deep learning model yumer2016learning and the Chamfer distance for the objective function fan2016point. In the following subsections, we first introduce metric learning based shape template retrieval, which finds the most topologically similar 3D shape template for a given image (Sec. 3.1). Then, we introduce DeformNet, which takes a set of templates that we use for deformation and a query image and learns to output the deformation field end-to-end given the target shape (Sec. 3.3). Lastly, we present two objective functions that measure the deviation of the deformed shape from the target shape by finding correspondences on-the-fly (Sec. 3.4). See Figure 1 for illustration.
3.1 Shape Template Retrieval
To make use of the high-quality 3D CAD models in the existing database, the first step of the framework is to retrieve shape templates that have a similar topology to the object in a query image. For this, we use metric learning to learn an embedding that preserves topological similarity between shapes. Specifically, we first define the metric space by a set of constraints that force dissimilar shapes to be at least a margin father away than the distance between similar objects.
where is a margin, indicates a set containing all elements in the -th class and is an arbitrary distance function. We use a parametric metric space representation using a neural network where denote the parameters in the neural network. The above constraints can be converted to a loss function which forms a triplet loss schroff2015facenet. By minimizing the loss function with respect to , we can generate a feature representation that preserves perceptual similarity in a metric space where the distance operation is meaningful. However, due to difficulty in mining hard negatives and inefficiency of not reusing features in a batch, the naive metric learning approach only yields a moderate result schroff2015facenet; oh2016deep. Instead, we use the smoothed version of the triplet loss that reuses all features in a batch for efficient hard negative mining oh2016deep and allows fast and effective training.
where indicates and denotes a set of shapes that belong to different categories from the category of . Specifically, we used a 2D CNN to implement and denotes rendering of the -th shape. We define the positive pairs to be renderings of the same shape from different perspectives and negatives to be renderings from different shapes. We define the similarity to be the inverse of the distance in the metric space and retrieve -nearest neighbors from a query image and use the shapes that generated the images for the next stage.
3.2 DeformNet: Model
Given a reference image and a shape template that closely matches the object in the input image from the Shape Retrieval stage (Sec. 3.1), we want to generate the parameters of a deformation which transforms the shape template into the shape in the reference image. Unlike conventional CNN for reconstruction, DeformNet takes two different modalities as inputs and thus has two CNNs in these respective domains to encode the inputs. First, for the 2D image, we used 2D CNN for an image encoder and for the 3D shape input, we voxelized sampled points on the surface to generate dense point cloud and voxelized to feed into a 3D CNN encoder . We simply combined the information from two modes by stacking the final fully connected layer activation to the last 3D CNN activation along the channel so that we preserve the spatial information from the 3D shape template. The combined information from both encoders contains all information that we need to know from both inputs and thus call the latent variable . is then fed into a 3D decoder , a 3D Deconvolutional Neural Network, to expand the spatial dimension of the output. Since the 3D encoder loses spatial details as we project the activations from 3D convolution layers to a coarser voxel grid, we use a 3D U-Net structure which is an extension of the 2D U-Net proposed in yumer2016learning to recover the details in the output. The U-net has an hourglass shape and skip connections between the 3D encoder and 3D decoder that crosses the latent variable (Fig. 1
). The final output of the decoder is a vector field, which is used as the offset for the control points in the Free-Form Deformation Layer (Sec. 3.3). Each offset , represents offset of the corresponding control point. These values are used to then compute the deformed point cloud output, which is the final output of the network, and which can then be compared against the ground truth point cloud of the input image shape.
3.3 Free-Form Deformation Layer
Free-Form Deformation (FFD) sederberg1986free is the 3D extension of a Bezier curve form, which has been widely used for shape deformation. Since it is defined on a 3D grid, FFD fits with 3D convolutional neural network and has been used for generating 3D deformation using a neural network beforeyumer2016learning. The DeformNet learns the FFD for every input image - shape template pair, and the predicted FFD on shape template will generate the final deformed shape. In this paper, we mainly focus on manipulating and evaluating reconstructions with point clouds, though the learned FFD could also be applied to other formats. FFD is formally formulated as following. Let be the normalized point coordinate in the grid and be the 3D deformation offset at the grid control point . Then, the point after deformation is defined as
where is a binomial function, and
are sizes of the control point grid. The interpolation will mix the displacement of all control points around each data point, generating a smooth deformed output. The 3D decoder outputs deformation field. Note that is differentiable with respect to , which guarantees that the backward propagated gradients could flow from the objective function on the top of FFD, making FFD learnable.
3.4 Objective Functions
To make the DeformNet end-to-end trainable, we need to define a loss function that optimizes the target task: deforming a template shape to match a target shape. Ideally, the function should measure difference between deformed shape and a template shape and should return 0 if and only if the deformed shape overlaps the template exactly. However, 3D shapes are defined by vertices and faces whose accurate topological similarity is difficult to measure. So, rather than measuring the topological similarity, we sample points on the surface of a shape and use the set of points (point cloud) as a surrogate for a shape. Point clouds are easy to manipulate due to their simplicity and efficiency, and we follow fan2016point by using distance functions on point clouds as the loss function. If the network generates the correct deformation field that minimizes the point cloud distances, then the deformed point could will match the target point cloud accurately as well. We explore two point cloud distance functions: Earth Mover’s distance (EMD) and Chamfer distance (CD) rubner2000earth. Both distance measures are based on point-wise L-2 distance. Earth Mover’s distance: The EMD is defined as the minimum of sum of distances between a point in one set and a point in another set over all possible permutation of correspondences. To find the minimum, the EMD implicitly solves bipartite matching problem. More formally, given two sets of points , the EMD is defined as
where is some bijection from to . Chamfer distance: CD is computationally easier than EMD, since it uses sub-optimal matching to determine the pair-wise relation. For each point in one point set, CD simply treat the nearest neighbor in the other set as the image of this point. Formally, CD is defined as
We use two forms of regularization in addition to the distance function, the first being L1 loss over all point cloud offsets to force the network to deform the template as little as possible, and the second being L2 loss over the difference between neighboring control point offsets to promote smooth deformation.
3.5 Implementation Details
For shape retrieval, we use the multiview rendered images of shape from choy20163d to train the network and used the GoogLeNet model szegedy2015going
. At training and testing time, the template shape input is chosen at random from the 5 most similar shapes from the same category. For EMD, we precomputed ground truth correspondences between each input and its 5 most similar shapes using the Hungarian Algorithm; thus it should be noted that the correspondence is based on the undeformed template, which is an approximation of true EMD. We evaluated using true EMD loss and found it to not work better, and as discussed in the appendix focused on using Chamfer distance as the loss for training. For Chamfer distance, we use the output of the network directly to compute the distance. We used TensorFlow to implement the networks and used Adam optimizer with momentum term of 0.95. The model is trained with an initial learning rate of 5e-4 and goes down to 5e-5 after 20k iterations. After selection based on performance, we choose a batch size of 16 for training. We use leaky ReLU as an activation function. Note that to make use of EMD, we resample the point clouds to normalize the number of points. We train on point clouds with 1024 points. For deformation, we set N=4 as the number of control points in each dimension that points are computed with (so each point in the deformed output point cloud is a function ofcontrol points). We use for regularization on control point offsets.
4 Experimental Evaluation
4.1 Experimental Setup
|Ours||Rtrv||No Rtrv||PSGNfan2016point||3D-R2N2choy20163d||Ours||Rtrv||No Rtrv||PSGNfan2016point||3D-R2N2choy20163d|
We train and evaluate our models on the ShapeNet database shapenet, which contains a large quantity of manually created and cleaned 3D CAD models. Specifically, we select 5 representative categories to study on: chair, car, airplane, bench and sofa, following Gwak et al. wsgan. The images for training and testing are rendered in various angles to provide synthetic training data for the model. In total, 22,324 shape models are covered, where training/testing split is 80%/20%. The 3D CAD objects are originally stored as meshes, so we enriched the dataset via resampling the meshes into point clouds and voxelizing them into voxels.
4.2 3D Shape Reconstruction from RGB Images
We compare our work with point set generation network (PSGN) fan2016point, which is the state-of-the-art in deep learning based 3D reconstruction from a single image. PSGN chooses point clouds as the 3D representation, which allows manipulation including geometric transformations and deformations. Also, point clouds can in principle contain more information than voxel representations due to the latter’s discrete nature, and point clouds are easy to convert into voxels whereas the other way around will be tricky. Therefore we also target point clouds, though the learned free-form deformation can be applied to both voxel, point cloud, and mesh. On point clouds, PSGN proposed two metrics for training and evaluating the reconstruction - CD and EMD. To have a fair comparison, we use the same point-set based metrics and follow their experiment setups. We train and test with relatively sparse point clouds with 1024 points, though as will be demonstrated in section 4.6 our model can be applied to dense point clouds despite being trained with sparse ones. To have comparable scaling of distances, during evaluation point clouds are bounded in a hemisphere with a radius of unit 1 and are aligned to their ground plane. Unit 1 is defined as 1/10 of the length of the 3D grid as done in PSGN. We train our models on all five categories with only the CD the loss (as it was found that EMD provided no benefit over CD), and provide both CD and EMD metrics on the test set alongside the same metrics for the newest trained model released by PSGN. Here we report the per category comparison on both CD and EMD metrics in Table 1. In PSGN, they include the mean value of point-set based metrics from 3D-R2N2 choy20163d, thus we also list them here. On CD metrics, we outperform PSGN and 3D-R2N2 on all categories; on EMD, PSGN achieves better values on the car and sofa categories, but our model performs significantly better in the other three categories as well as the average value. This indicates PSGN has good performance on rotund and less detailed objects, while our strength is on objects with more fine-grained details such as chairs and airplanes. Both our work and PSGN focus on reconstruction on point clouds, where the point-set based metrics are straightforward and intuitive. To contrast the two methods clearly, we show the visual comparison in Figure 2. In general, our reconstruction can recover the main features from the object, even when the input template is not ideal for the image. This can be attributed to the combined benefits of shape retrieval and deformation - shape retrieval alone provides very good complete 3D shape templates without missing features, and deformation is able to preserve all of the template’s main features while tweaking them to more closely match their shape in the image input.
4.3 Ablation Analysis
Sensitivity to input shape template. Image-based shape retrieval could provide reasonable CAD model as template to start deforming with. We therefore also provide analysis on the degree to which DeformNet relies on having a good shape template.
To do this analysis, we compute the statistics on test set of all categories: for each group of input template (In), output (Out), ground truth shapes, we compute the tuple . Figure 3 shows the scattering plot of 516 random groups. Most of them lie closely to the horizontal axis, showing that DeformNet is not very sensitive to the input template’s quality, which indicates the robustness of DeformNet. We also report the CD and EMD with randomly picked shape template as input (without shape retrieval) in 2, averaged on all categories. Note that DeformNet without shape template retrieval still outperforms PSGN on average. Joint skip connections. The joint skip connections stacking activations from 3D encoder and 3D decoder serve as information conveyors on the same level of spatial resolution. To verify its importance, we trained a network without the joint skip connections and measured its performance in CD and EMD metrics as shown in Table 2. Since the joint skip connections feed sharp information from the shape directly to deeper layers in the decoder, the reconstruction is more accurate in terms of CD loss.
Regularizer. To understand the functionality of the deformation regularizer, we have trained and tested a network without the regularization term in the objective function. Quantitatively, the model without regularization performs slightly better than the model with it, as shown in Table 2. However, the regularizer encourages more conservative and smooth deformation. For mesh reconstruction, this lowers the chance for faces to cross during deformation. As illustrated in the mesh reconstruction experiment in Figure 6, the regularizer enforces the consistency among offsets of neighboring control points, which is not guaranteed in the model without the regularizer. Point cloud density. One advantage of FFD is that the same deformation can be applied to any number of points in the control points grid, which bridges the gap between low and high resolution. To have a fair comparison with PSGN, we also trained on sparse point clouds with 1024 points, but our trained model could be directly used on deforming and reconstructing dense point clouds. Figure 4 shows the comparison between reconstructed sparse and dense point clouds, using the model trained on sparse point clouds. In general, network deforms dense point clouds similarly to sparse ones.
4.4 3D Reconstruction with Real Images
We also tested our model on real world images and visualization results are shown in Figure 5
, including a failure case. Since we trained our model on synthesized images with single-colored backgrounds, we segmented real images as the input into network as done in PSGN. Our network successfully infers the 3D shape in some cases, but fails on some others. One solution could be domain adaptation and transfer learning, to bridge the gap between rendered and real world images, which we leave to future work.
4.5 Mesh Reconstruction
Free-form deformation has been widely used on deforming mesh objects, and with FFD layer in a deep neural network, the learnable manipulation and reconstruction on mesh become possible. As analyzed in section 4.3, the regularized model refrains from drastic changes in a local patch, which gives out mesh reconstruction with plausible quality. Figure 6 shows an example of mesh reconstruction on a chair.
5 Conclusions and Outlook
This paper examines the structured prediction problem of regressing 3D point clouds based on image input to solve 3D Reconstruction with a single image. We note that existing methods with volumetric representations fall short on computational efficiency. This paper leverages the Point cloud-based representation coupled with a 3D generalization of the Spatial Transformer using Free-Form Deformation to achieve state of the art results on reconstruction with ShapeNet. We introduce Free Form Deformation as a differentiable layer to enable 3D data manipulation. This can have wider implications beyond 3D reconstruction, such as in point-cloud processing, and learning to perform grasping on unseen objects.
6 Supplementary Material
6.1 CD vs EMD
|Train \ Test||CD||EMD|
We trained DeformNet trained with both CD and EMD as objective functions, then test both of them using CD and EMD metrics. The average distance is reported on the Table 3. In terms of the point-set based metrics, the performance of these two models are similar. Besides, from our observation on output visualization, we don’t find significant difference between these two metrics. With respect to computation efficiency, CD runs much faster with KDTree searching for nearest neighbors, so we primarily used it for the evaluations in this paper.
6.2 Additional Images
We present 18 additional qualitative examples, drawn at random from the test set: