Code for paper "Learning Free-Form Deformations for 3D Object Reconstruction"
Representing 3D shape in deep learning frameworks in an accurate, efficient and compact manner still remains an open challenge. Most existing work addresses this issue by employing voxel-based representations. While these approaches benefit greatly from advances in computer vision by generalizing 2D convolutions to the 3D setting, they also have several considerable drawbacks. The computational complexity of voxel-encodings grows cubically with the resolution thus limiting such representations to low-resolution 3D reconstruction. In an attempt to solve this problem, point cloud representations have been proposed. Although point clouds are more efficient than voxel representations as they only cover surfaces rather than volumes, they do not encode detailed geometric information about relationships between points. In this paper we propose a method to learn free-form deformations (FFD) for the task of 3D reconstruction from a single image. By learning to deform points sampled from a high-quality mesh, our trained model can be used to produce arbitrarily dense point clouds or meshes with fine-grained geometry. We evaluate our proposed framework on both synthetic and real-world data and achieve state-of-the-art results on point-cloud and volumetric metrics. Additionally, we qualitatively demonstrate its applicability to label transferring for 3D semantic segmentation.READ FULL TEXT VIEW PDF
Code for paper "Learning Free-Form Deformations for 3D Object Reconstruction"
Imagine one wants to interact with objects from the real world, say a chair, but in an augmented reality (AR) environment. The 3D reconstruction from the seen images should appear as realistic as possible so that one may not even perceive the chair as being virtual. The future of highly immersive AR and virtual reality (VR) applications highly depends on the representation and reconstruction of high-quality 3D models. This is obviously challenging and the computer vision and graphics communities have been working hard on such problems [1, 2, 3].
The impact that recent developments in deep learning approaches have had on computer vision has been immense. In the 2D domain, convolutional neural networks (CNNs) have achieved state-of-the-art results in a wide range of applications[4, 5, 6]. Motivated by this, researchers have been applying the same techniques to represent and reconstruct 3D data. Most of them rely on volumetric shape representation so one can perform 3D convolutions on the structured voxel grid [7, 8, 9, 10, 11, 12]. A drawback is that convolutions on the 3D space are computationally expensive and grow cubically with resolution, thus typically limiting the 3D reconstruction to exceedingly coarse representations.
Point cloud representation has recently been investigated to make the learning more efficient [13, 14, 15, 16]. However, such representations still lack the ability of describing finely detailed structures. Applying surfaces, texture and lighting to unstructured point clouds are also challenging, specially in the case of noisy, incomplete and sparse data.
The most extensively used shape representation in computer graphics is that of polygon meshes, in particular using triangular faces. This parameterisation has largely been unexplored in the machine learning domain for the 3D reconstruction task. This is in part a consequence of most machine learning algorithms requiring regular representation of input and output data such as voxels and point clouds. Meshes are highly unstructured and their topological structure usually differs from one to another which makes their 3D reconstruction from 2D images using neural networks challenging.
In this paper, we tackle this problem by exploring the well-known free-form deformation (Ffd) technique  widely used for 3D mesh modelling. Ffd allows us to deform any 3D mesh by repositioning a few predefined control points while keeping its topological aspects. We propose an approach to perform 3D mesh reconstruction from single images by simultaneously learning to select and deform template meshes. Our method uses a lightweight CNN to infer the low-dimensional Ffd parameters for multiple templates and it learns to apply large deformations to topologically different templates to produce inferred meshes with similar surfaces. We extensively demonstrate relatively small CNNs can learn these deformations well, and achieve compelling mesh reconstructions with finer geometry than standard voxel and point cloud based methods. An overview of the proposed method is illustrated in Figure 1. Furthermore, we visually demonstrate our proposed learning framework is able to transfer semantic labels from a 3D mesh onto unseen objects.
Our contributions are summarized as follows:
We propose a novel learning framework to reconstruct 3D meshes from single images with finer geometry than voxel and point cloud based methods;
we quantitatively and qualitatively demonstrate that relatively small neural networks require minimal adaptation to learn to simultaneously select appropriate models from a number of templates and deform these templates to perform 3D mesh reconstruction;
we extensively investigate simple changes to training and loss functions to promote variation in template selection; and
we visually demonstrate our proposed method is able to transfer semantic labels onto the inferred 3D objects.
Interest in analysing 3D models has increased tremendously in recent years. This development has been driven in part by a rapid growth of the amount of readily available 3D data, the astounding progress made in the field of machine learning as well as a substantial rise in the number of potential application areas, i.e. Virtual and Augmented Reality.
To address 3D vision problems with deep learning techniques a good shape representation should be found. Volumetric representation has been the most widely used for 3D learning [18, 19, 20, 7, 21, 22, 8, 9, 12, 10, 11]
. Convolutions, pooling, and other techniques that have been successfully applied to the 2D domain can be naturally applied to the 3D case for the learning process. Volumetric autoencoders[23, 21] and generative adversarial networks (GANs) have been introduced [24, 25, 26] to learn probabilistic latent space of 3D objects for object completion, classification and 3D reconstruction. Volumetric representation however grows cubically in terms of memory and computational complexity as the voxel grid resolution increases, thus limiting it to low-quality 3D reconstructions.
To overcome these limitations, octree-based neural networks have been presented [27, 28, 29, 30], where the volumetric grid is split recursively by dividing it into octants. Octrees reduce the computational complexity of the 3D convolution since the computations are focused only on regions where most of the object’s geometry information is located. They allow for higher resolution 3D reconstructions and a more efficient training, however, the outputs still lack of fine-scaled geometry. A more efficient 3D representation using point clouds was recently proposed to address some of these drawbacks [13, 14, 15, 16]. In  a generative neural network was presented to directly output a set of unordered 3D points that can be used for the 3D reconstruction from single image and shape completion tasks. By now, such architectures have been demonstrated for the generation of relatively low-resolution outputs and to scale these networks to higher resolution is yet to be explored.
3D shapes can be efficiently represented by polygon meshes which encode both geometrical (point cloud) and topological (surface connectivity) information. However, it is difficult to parametrize meshes to be used within learning frameworks . A deep residual network to generate 3D meshes has been proposed in . A limitation however is that they adopted the geometry image representation for generative modelling of 3D surfaces so it can only manage simple (i.e. genus-0) and low-quality surfaces. In , the authors reconstruct 3D mesh objects from single images by jointly analysing a collection of images of different objects along with a smaller collection of existing 3D models. While the method yields impressive results, it suffers from scalability issues and is sensitive to semantic segmentation of the image and dense correspondences.
Ffd has also been explored for 3D mesh representation where one can intrinsically represent an object by a set of polynomial basis functions and a small number of coefficients known as control points used for cage-like deformation. A 3D mesh editing tool proposed in  uses a volumetric generative network to infer per-voxel deformation flows using Ffd. Their method takes a volumetric representation of a 3D mesh as input and a high-level deformation intention label (e.g. sporty car, fighter jet, etc.) to learn the Ffd displacements to be applied to the original mesh. In [34, 35] a method for 3D mesh reconstruction from a single image was proposed based on a low-dimensional parametrization using Ffd and sparse linear combinations given the image silhouette and class-specific landmarks. Recently, the DeformNet was proposed in  where they employed Ffd as a differentiable layer in their 3D reconstruction framework. The method builds upon two networks, one 2D CNN for 3D shape retrieval and one 3D CNN to infer Ffd parameters to deform the 3D point cloud of the shape retrieved. In contrast, our proposed method reconstructs 3D meshes using a single lightweight CNN with no 3D convolutions involved to infer a 3D mesh template and its deformation flow in one shot.
We focus on the problem of inferring a 3D mesh from a single image. We represent a 3D mesh by a list of vertex coordinates and a set of triangular faces defined such that indicates there is a face connecting the vertices , and , i.e. .
Given a query image, the task is to infer a 3D mesh which is close by some measure to the actual mesh of the object in the image. We employ the Ffd technique to deform the 3D mesh to best fit the image.
There are a number of metrics which can be used to compare 3D meshes. We consider three: Chamfer distance and earth mover distance between point clouds, and the intersection-over-union (IoU) of their voxelized representations.
The Chamfer distance between two point clouds and is defined as
The earth mover  distance between two point clouds of the same size is the sum of distances between a point in one cloud and a corresponding partner in the other minimized over all possible 1-to-1 correspondences. More formally,
where is a bijective mapping.
Point cloud metrics evaluated on vertices of 3D meshes can give misleading results, since large planar regions will have very few vertices, and hence contribute little. Instead, we evaluate these metrics using a set of points sampled uniformly from the surface of each 3D mesh.
As the name suggests, the intersection-over-union of volumetric representations is defined by the ratio of the volumes of the intersection over their union,
We deform a 3D object by freely manipulating some control points using the Ffd technique. Ffd
creates a grid of control points and its axes are defined by the orthogonal vectorsand . The control points are then defined by and which divides the grid in planes in the directions, respectively. A local coordinate for each object’s vertex is then imposed.
In this work, we deform an object through a trivariate Bernstein tensor as in which is basically a weighted sum of the control points. The deformed position of any arbitrary point is given by
where contains the coordinates of the displaced point, is the Bernstein polynomial of degree which sets the influence of each control point on every model’s vertex, and is the -th control point. Equation (4) is a linear function of and it can be written in a matrix form as
where the rows of are the vertices of the 3D mesh, is the deformation matrix, are the control point coordinates, and and are the number of vertices and control points, respectively.
Our method involves applying deformations encoded with a parameter to different template models with . We begin by calculating the Bernstein decomposition of the face-sample point cloud of each template model, . We use a CNN to map a rendered image to a set of shared high level image features. These features are then mapped to a grid deformation for each template independently to produce a perturbed set of surface points for each template,
We also infer a weighting value for each template from the shared image features, and train the network using a weighted Chamfer loss,
where is some positive monotonically increasing scalar function.
In this way, the network learns to assign high weights to templates which it has learned to deform well based on the input image, while the sub-networks for each template are not highly penalized for examples which are better suited to other templates. We enforce a minimum weight by using
where is the result of a softmax function where summation is over the templates and is a small constant threshold, .
For inference and evaluation, we use the template with the highest weight,
Key advantages of the architecture are as follows:
no 3D convolutions are involved, meaning the network scales well with increased resolution;
no discretization occurs, allowing higher precision than voxel-based methods;
the output can be used to generate an arbitrarily dense point cloud – not necessarily the same density as that used during training; and
a mesh can be inferred by applying the deformation to the Bernstein decomposition of the vertices while maintaining the same face connections.
the network size scales linearly with the number of templates considered; and
there is at this time no mechanism to explicitly encourage topological or semantic similarity.
Preliminary experiments showed training using standard optimizers with an identity weighting function resulting in a small number of templates being selected frequently. This is at least partially due to a positive feedback loop caused by the interaction between the weighting sub-network and the deformation sub-networks. If a particular template deformation sub-network performs particularly well initially, the weighting sub-network learns to assign increased weight to this template. This in turn affects the gradients which flow through the deformation sub-network, resulting in faster learning, improved performance and hence higher weight in subsequent batches. We experimented with a number of network modifications to reduce this.
One problem with the identity weighting scheme ( is that there is no penalty for over-confidence. A well-trained network with a slight preference for one template over all the rest will be inclined to put all weight into that template. By using an with positive curvature, we discourage the network from making overly confident inferences. We experimented with an entropy-inspired weighting .
Another approach is to penalize the lack of diversity directly by introducing an explicit entropy loss term,
where is the weight value of template
averaged over the batch. This encourages an even distribution over the batch but still allows confident estimates for the individual inferences. For these experiments, the network was trained with a linear combination of weighted Chamfer lossand the entropy penalty,
While a large entropy error term encourages all templates to be assigned weight and hence all subnetworks to learn, it also forces all subnetworks to try and learn all possible deformations. This works against the idea of specialization, where each subnetwork should learn to deform their template to match query models close to their template. To alleviate this, we anneal the entropy over time
where is the initial weighting, is the batch index and is some scaling factor.
In order to encourage the network to select a template requiring minimal deformation, we introduce a deformation regularization term,
where is the squared 2-norm of the vectorized input.
Large regularization encourages a network to select the closest matching template, though punishes subnetworks for deforming their template a lot, even if the result is a better match to the query mesh. We combine this regularization term with the standard loss in a similar way to the entropy loss term,
where is an exponentially annealed weighting with initial value .
For the algorithm to result in high-quality 3D reconstructions, it is important that the vertex density of each template mesh is approximately equivalent to (or higher than) the point cloud density used during training. To ensure this is the case, we subdivide edges in the template mesh such that no edge length is greater than some threshold . Example cases where this is particularly important are illustrated in Figure 2.
We employed a MobileNet architecture that uses depthwise separable convolutions to build light weight deep neural networks for mobile and embedded vision applications  without the final fully connected layers and with width . Weights were initialized from the convolutional layers of a network trained on the ImageNet dataset . To reduce dimensionality, we add a single convolution after the final MobileNet convolution layer. After flattening the result, we have one shared fully connected layer with nodes followed by a fully connected layer for each template. A summary of layers and output dimensions is given in Table 1.
We used a subset of the ShapeNet Core dataset  over a number of categories, using an 80/20 train/evaluation split. All experiments were conducted using control points in each dimension () for the free form parametrizations. To balance computational cost with loss accuracy, we initially sampled all models surfaces with points for both labels and free form decomposition. At each step of training, we sub-sampled a different points for use in the Chamfer loss.
All input images were and were the result of rendering each textured mesh from the same view from above the horizontal, away from front-on and well-lit by a light above and on each side of the model. We trained a different network with 30 templates for each category. Templates were selected manually to ensure good variety. Models were trained using a standard Adam optimizer with learning rate , , and . Mini-batches of 32 were used, and training was run for steps. Exponential annealing used .
For each training regime, a different model was trained for each category. Hyper-parameters for specific training regimes are given in Table 2.
To produce meshes and subsequent voxelizations and IoU scores, template meshes had edges subdivided to a maximum length of . We voxelize on a grid by initially assigning any voxel containing part of the mesh as occupied, and subsequently filling in any unoccupied voxels with no free path to the outside.
Qualitatively, we observe the networks preference in applying relatively large deformations to geometrically simple templates, and do not shy away from merging separate features of template models. For example, models frequently select the bi-plane template and merge the wings together to match single-wing aircraft, or warp standard 4-legged chairs and tables into structurally different objects as shown in Figure 3.
For point cloud comparison, we compare against the works of Kuryenkov et al.  (DN) and Fan et al.  (PSGN) for 5 categories. We use the results for the improved PSGN model reported in . We use the same scaling as in these papers, finding transformation parameters that transform the ground-truth meshes to a minimal bounding hemisphere of radius and applying this transformation to the inferred clouds. We also compare IoU values with PSGN  on an additional 8 categories for voxelized inputs on a grid. Results for all 13 categories with each different training regime are given in Table 3.
All our training regimes out-perform the others by a significant margin on all categories for point-cloud metrics ( and ). We also outperform PSGN on IoU for most categories and on average. The categories for which the method performs worst in terms of IoU – tables and chairs – typically feature large, flat surfaces and thin structures. Poor IoU scores can largely be attributed to poor width or depth inference (a difficult problem given the single view provided) and small, consistent offsets that do not induce large Chamfer losses. An example is shown in Figure 8.
We begin our analysis by investigating the number of times each template was selected across the different training regimes, and the quality of the match of the undeformed template to the query model. Results for the sofa and table categories are given in Figure 13.
We illustrate the typical behaviour of our framework with the sofa and table categories, since these are categories with topologically similar models and topologically different models respectively. In both cases, the base training regime (b) resulted in a model with template selection dominated by a small number of templates, while additional loss terms in the form of deformation regularization (r) and entropy (e) succeeded in smearing out this distribution to some extent. The behaviour of the non-linear weighting regime (w) is starkly different across the two categories however, reinforcing template dominance for the category with less topological differences across the dataset, and encouraging variety for the table category.
In terms of the Chamfer loss, all training regimes produced deformed models with virtually equivalent results. The difference is apparent when inspecting the undeformed models. Unsurprisingly, penalizing large deformation via regularization results in the best results for the undeformed template, while the other two non-base methods selected templates slightly better than the base regime.
To further investigate the effect of template selection on the model, we trained a base model with a single template (), and entropy models with templates for the sofa dataset. In each case, the top templates selected by the 30-template regularized model were used. Cumulative Chamfer losses and IoU scores are shown in Figure 16.
Surprisingly, the deformation networks manage to achieve almost identical results on these metrics regardless of the number of templates available. Additional templates do improve accuracy of the undeformed model up to a point, suggesting the template selection mechanism is not fundamentally broken.
While no aspect of the training related to semantic information, applying the inferred deformations to a semantically labelled point cloud allows us to infer another semantically labelled point cloud. Some examples are shown in Figure 33. For cases where the template is semantically similar to the query object, the additional semantic information is retained in the inferred cloud. However, some templates either do not have points of all segmentation types, or have points of segmentation types that are not present in the query object. In these cases, while the inferred point cloud matches the surface relatively well, the semantic information is unreliable.
We have presented a simple framework for combining modern CNN approaches with detailed, unstructured meshes by using Ffd as a fixed sized intermediary and simultaneously learning to select and deform template point clouds based on minimally adjusted off-the-shelf image processing networks. We significantly out-perform state-of-the-art methods with respect to point cloud generation, and perform at-or-above state-of-the-art on the volumetric IoU metric, despite our network not being optimized for it. We present various mechanisms by which the diversity of templates selected can be increased and demonstrate these result in modest improvements.
We demonstrate the main component of the low metric scores is the ability of the network to learn deformations tailored to specific templates, rather than the precise selection of these templates. Models with only a single template to select from achieve comparable results to those with a greater selection at their disposal. This indicates the choice of template – and hence any semantic of topological information – makes little difference to the resulting point cloud, diminishing the trustworthiness of such topological or semantic information about the inferred model.
This research was supported by the ARC grants DP170100632 and FT170100072. Computational resources and services used in this work were provided by the HPC and Research Support Group, Queensland University of Technology, Brisbane, Australia.
The earth mover’s distance as a metric for image retrieval.International journal of computer vision 40(2) (2000) 99–121