Applications in virtual and augmented reality and robotics require rapid creation and access to a large number of 3D models. Even with the increasing availability of large 3D model databases , the size and growth of such databases pale when compared to the vast size of 2D image databases. As a result, the idea of editing or deforming existing 3D models based on a reference image or another source of input such as an RGBD scan is pursued by the research community.
Traditional approaches for editing 3D models to match a reference target rely on optimization-based pipelines which either require user interaction  or rely on the existence of a database of segmented 3D model components 
. The development of 3D deep learning methods[17, 2, 31, 28, 10] inspire more efficient alternative ways to handle 3D data. In fact, a multitude of approaches have been presented over the past few years for 3D shape generation using deep learning. Many of these, however, utilize voxel [33, 5, 37, 29, 24, 30, 34, 27] or point based representations 
since the representation of meshes and mesh connectivity in a neural network is still an open problem. The few recent methods which do use mesh representations make assumptions about fixed topology[7, 25] which limits the flexibility of their approach.
This paper describes 3DN
, a 3D deformation network that deforms a source 3D mesh based on a target 2D image, 3D mesh, or a 3D point cloud (e.g., acquired with a depth sensor). Unlike previous work which assume a fixed topology mesh for all examples, we utilize the mesh structure of the source model. This means we can use any existing high-quality mesh model to generate new models. Specifically, given any source mesh and a target, our network estimates vertex displacement vectors (3D offsets) to deform the source model while maintaining its mesh connectivity. In addition, the global geometric constraints exhibited by many man-made objects are explicitly preserved during deformation to enhance the plausibility of the output model.
Our network first extracts global features from both the source and target inputs. These are input to an offset decoder
to estimate per-vertex offsets. Since acquiring ground truth correspondences between the source and target is very challenging, we use unsupervised loss functions (e.g., Chamfer and Earth Mover’s distances) to compute the similarity of the deformed source model and the target. A difficulty in measuring similarity between meshes is the varying mesh densities across different models. Imagine a planar surface represented by just 4 vertices and 2 triangles as opposed to a dense set of planar triangles. Even though these meshes represent the same shape, vertex-based similarity computation may yield large errors. To overcome this problem, we adopt a point cloud intermediate representation. Specifically, we sample a set of points on both the deformed source mesh and the target model and measure the loss between the resulting point sets. This measure introduces a differentiable mesh sampling operator which propagates features, e.g., offsets, from vertices to points in a differentiable manner.
We evaluate our approach for various targets including 3D shape datasets as well as real images and partial points scans. Qualitative and quantitative comparisons demonstrate that our network learns to perform higher quality mesh deformation compared to previous learning based methods. We also show several applications, such as shape interpolation. In conclusion, our contributions are as follows:
We propose an end-to-end network to predict 3D deformation. By keeping the mesh topology of the source fixed and preserving properties such as symmetries, we are able to generate plausible deformed meshes.
We propose a differentiable mesh sampling operator in order to make our network architecture resilient to varying mesh densities in the source and target models.
2 Related Work
2.1 3D Mesh Deformation
3D mesh editing and deformation has received a lot of attention from the graphics community where a multitude of interactive editing systems based on preserving local Laplacian properties  or more global features  have been presented. With easy access to growing 2D image repositories and RGBD scans, editing approaches that utilize a reference target have been introduced. Given source and target pairs, such methods use interactive  or heavy processing pipelines  to establish correspondences to drive the deformation. The recent success of deep learning has inspired alternative methods for handling 3D data. Yumer and Mitra propose a volumetric CNN that generates a deformation field based on a high level editing intent. This method relies on the existence of model editing results based on semantic controllers. Kurenkov et al. present DeformNet  which employs a free-form deformation (FFD) module as a differentiable layer in their network. This network, however, outputs a set of points rather than a deformed mesh.Furthermore, the deformation space lacks smoothness and points move randomly. Groueix et al.  present an approach to compute correspondences across deformable models such as humans. However, they use an intermediate common template representation which is hard to acquire for man-made objects. Pontes et al.  and Jack et al.  introduce methods to learn FFD. Yang et al. propose Foldingnet  which deforms a 2D grid into a 3D point cloud while preserving locality information. Compared to these existing methods, our approach is able to generate higher quality deformed meshes by handling source meshes with different topology and preserving details in the original mesh.
2.2 Single View 3D Reconstruction
Our work is also related to single-view 3D reconstruction methods which have received a lot of attention from the deep learning community recently. These approaches have used various 3D representations including voxels [33, 2, 5, 37, 29, 24, 30, 34], point clouds , octrees [23, 8, 26], and primitives [38, 15]. Sun et al.  present a dataset for 3D modeling from single-images. However, pose ambiguity and artifacts widely occur in this dataset. More recently, Sinha et al.  propose a method to generate the surface of an object using a representation based on geometry images. In a similar approach, Groueix et al.  present a method to generate surfaces of 3D shapes using a set of parametric surface elements. The more recent method of Hiroharo et al.  and Kanazawa et al.  also uses differentiable renderer and per-vertex displacements as a deformation method to generate meshes from image sets. Wang et al.  introduce a graph-based network to reconstruct 3D manifold shapes from input images. These recent methods, however, are limited to generating manifolds and require 3D output to be topology invariant for all examples.
Given a source 3D mesh and a target model (represented as a 2D image or a 3D model), our goal is to deform the source mesh such that it resembles the target model as close as possible. Our deformation model keeps the triangle topology of the source mesh fixed and only updates the vertex positions. We introduce an end-to-end 3D deformation network (3DN) to predict such per-vertex displacements of the source mesh.
We represent the source mesh as , where is the positions of vertices and is the set of triangles and encodes each triangle with the indices of vertices. and denote the number of vertices and triangles respectively. The target model is either a image or a 3D model. In case is a 3D model, we represent it as a set of 3D points , where denotes the number of points in .
As shown in Figure 2, 3DN takes and as input and outputs per-vertex displacements, i.e., offsets, . The final deformed mesh is , where . Moreover, 3DN can be extended to produce per-point displacements when we replace the input source vertices with a sampled point cloud on the source. 3DN is composed of a target and a source encoder which extract global features from the source and target models respectively, and an offset decoder which utilizes such features to estimate the shape deformation. We next describe each of these components in detail.
3.1 Shape Deformation Network (3DN)
Source and Target Encoders.
Given the source model , we first uniformly sample a set of points on and use the PointNet  architecture to encode into a source global feature vector. Similar to the source encoder, the target encoder extracts a target global feature vector from the target model. In case the target model is a 2D image, we use VGG  to extract features. If the target is a 3D model, we sample points on and use PointNet. We concatenate the source and target global feature vectors into a single global shape feature vector and feed into the offset decoder.
Given the global shape feature vector extracted by the source and target encoders, the offset decoder learns a function which predicts per-vertex displacements, for . In other words, given a vertex in , the offset decoder predicts updating the deformed vertex in to be .
Offset decoder is easily extended to perform point cloud deformations. When we replace the input vertex locations to point locations, say given a point in the point cloud sampled form , the offset decoder predicts a displacement , and similarly, the deformed point is .
The offset decoder has an architecture similar to the PointNet segmentation network . However, unlike the original PointNet architecture which concatenates the global shape feature vector with per-point features, we concatenate the original point positions to the global shape feature. We find this enables to better capture the vertex and point locations distribution in the source, and results in effective deformation results. We study the importance of this architecture in Section 4.3. Finally we note that, our network is flexible to handle source and target models with varying number of vertices.
3.2 Learning Shape Deformations
Given a deformed mesh produced by 3DN and the 3D mesh corresponding to the target model , where and , the remaining task is to design a loss function that measures the similarity between and . Since it is not trivial to establish ground truth correspondences between and , our method instead utilizes the Chamfer and Earth Mover’s losses introduced by Fan et al. . In order to make these losses robust to different meshing densities across source and target models, we operate on set of points uniformly sampled on and by introducing the differentiable mesh sampling operator (DMSO). DMSO is seamlessly integrated in 3DN and bridges the gap between handling meshes and loss computation with point sets.
Differentiable Mesh Sampling Operator.
As is illustrated in Figure 3, DMSO is used to sample a uniform set of points from a 3D mesh. Suppose a point is sampled on the face enclosed by the vertices . The position of is then
where are the barycentric coordinates of . Given any typical feature for the original vertices, the per-vertex offsets in our case, , the offset of is
To perform back-propogation, the gradient for each original per-vertex offsets is calculated simply by , where denotes the gradient.
We train 3DN using a combination of different losses as we discuss next.
Given a target model, , inspired by , we use Chamfer and Earth Mover’s distances to measure the similarity between the deformed source and the target. Specifically, given the point cloud sampled on the deformed output and sampled on the target model, Chamfer loss is defined as
and Earth Mover’s loss is defined as
where is a bijection.
We compute these distances between point sets sampled both on the source (using the DMSO) and target models. Moreover, computing the above losses on point sets sampled on source and target models further helps for robustness to different mesh densities. In practice, for each source-target model pair, we also pass a point cloud sampled on together with through the decoder offset in a second pass to help the network cope with sparse meshes. Specifically, given a point set sampled on , we predict per-point offsets and compute the above Chamfer and Earth Mover’s losses between the resulting deformed point cloud and . We denote these two losses as and . During testing, this second pass is not necessary and we only predict per-vertex offsets for .
We note that we train our model with synthetic data where we always have access to 3D models. Thus, even if the target is a 2D image, we use the corresponding 3D model to compute the point cloud shape loss. During testing, however, we do not need access to any 3D target models, since the global shape features required for offset prediction are extracted from the 2D image only.
Many man-made models exhibit global reflection symmetry and our goal is to preserve this during deformation. However, the mesh topology itself does not always guarantee to be symmetric, i.e., a symmetric chair does not always have symmetric vertices. Therefore, we propose to preserve shape symmetry by sampling a point cloud, , on the mirrored deformed output and measure the point cloud shape loss with this mirrored point cloud as
We note that we assume the reflection symmetry plane of a source model to be known. In our experiments, we use 3D models from ShapeNet  which are already aligned such that the reflection plane coincides with the plane.
Mesh Laplacian Loss.
To preserve the local geometric details in the source mesh and enforce smooth deformation across the mesh surface, we desire the Laplacian coordinates of the deformed mesh to be the same as the original source mesh. We define this loss as
where is the mesh Laplacian operator, and are the original and deformed meshes respectively.
Local Permutation Invariant Loss.
Most traditional deformation methods (such as FFD) are prone to suffer from possible self-intersections that can occur during deformation (see Figure 4). To prevent such self-intersections, we present a novel local permutation invariant loss. Specifically, given a point and a neighboring point at a distance to , we would like to preserve the distance between these two neighboring points after deformation as well. Thus, we define
where is a vector with a small magnitude and . In our experiments we define where . The intuition behind of this is to preserve the local ordering of points in the source. We observe that the local permutation invariant loss helps to achieve smooth deformation across 3D space.
Given all the losses defined above, we train 3DN with a combined loss of
where denote the relative weighting of the losses.
|(a) Source Template||(b) Target Mesh||(c) Target Point Cloud||(d)Poisson||(e)FFD||(f)AtlasNet||(g) Ours|
In this section, we perform qualitative and quantitative comparisons on shape reconstruction from 3D target models (Section 4.1) as well as single-view reconstruction (Section 4.2). We also conduct ablation studies of our method to demonstrate the effectiveness of the offset decoder architecture and the different loss functions employed. Finally, we provide several applications to demonstrate the flexibility of our method. More qualitative results and implementation details can be found in supplementary material.
In our experiments, we use the ShapeNet Core dataset  which includes 13 shape categories and an official traning/testing split. We use the same template set of models as in  for potential source meshes. There are 30 shapes for each category in this template set. When training the 2D image-based target model, we use the rendered views provided by Choy et al. . We note that we train a single network across all categories.
In order to sample source and target model pairs for 3DN, we train a PointNet based auto-encoder to learn an embedding of the 3D shapes. Specifically, we represent each 3D shape as a uniformly sampled set of points. The encoder encodes the points as a feature vector and the decoder predicts the point positions from this feature vector (please refer to the supplementary material for details). Given the embedding composed of the features extracted by the encoder, for each target model candidate, we choose the nearest neighbor in this embedding as the source model. Source models are chosen from the aforementioned template set. No class label information is required during this procedure, however, the nearest neighbors are queried within the same category. When given a target 2D image for testing, if no desired source model is given, we use the point set generation network, PSGN, to generate an initial point cloud, and use its nearest neighbor in our embedding as the source model.
Given a source and target model pair , we utilize three metrics in our quantitative evaluations to compare the deformation output and the target : 1) Chamfer Distance (CD) between the point clouds sampled on and , 2) Earth Mover’s Distance (EMD) between the point clouds sampled on and , 3) Intersection over Union (IoU) between the solid voxelizations of and . We normalize the outputs of our method and previous work into a unit cube before computing these metrics. We also evaluate the visual plausibility of our results by providing a large set of qualitative examples.
We compare our approach with state-of-the-art reconstruction methods. Specifically, we compare to three categories of methods: 1) learning-based surface generation, 2) learning-based deformation prediction, and 3) traditional surface reconstruction methods. We would like to note that we are solving a fundamentally different problem than surface generation methods. Even though, having a source mesh to start with might seem advantageous, our problem at hand is not easier since our goal is not only to generate a mesh similar to the target but also preserve certain properties of the source. Furthermore, our source meshes are obtained from a fixed set of templates which contain only 30 models per category.
4.1 Shape Reconstruction from Point Cloud
For this experiment, we define each 3D model in the testing split as target and identify a source model in the testing split based on the autoencoder embedding described above. 3DN computes per-vertex displacements to deform the source and keeps the source mesh topology fixed. We evaluate the quality of this mesh with alternative meshing techniques. Specifically, given a set of points sampled on the desired target model, we reconstruct a 3D mesh using Poisson surface reconstruction. As shown in Figure5, this comparison demonstrates that even with a ground truth set of points, generating a mesh that preserves sharp features is not trivial. Instead, our method utilizes the source mesh connectivity to output a plausible mesh. Furthermore, we apply the learning-based surface generation technique of AtlasNet  on the uniformly sampled points on the target model. Thus, we expect AtlasNet only to perform surface generation without any deformation. We also compare to the method of Jack et al.  (FFD) which introduces a learning based method to apply free form deformation to a given template model to match an input image. This network consists of a module which predicts FFD parameters based on the features extracted from the input image. We retrain this module such that it uses the features extracted from the points sampled on the 3D target model. As shown in Figure 5, the deformed meshes generated by our method are higher quality than the previous methods. We also report quantitative numbers in Table 1. While AtlastNet achieves lower error based on Chamfer Distance, we observe certain artifacts such as holes and disconnected surfaces in their results. We also observe that our deformation results are smoother than FFD.
4.2 Single-view Reconstruction
We also compare our method to recent state-of-the-art single view image based reconstruction methods including Pixel2Mesh , AtlasNet  and FFD . Specifically, we choose a target rendered image from the testing split and input to the previous methods. For our method, in addition to this target image, we also provide a source model selected from the template set. We note that the scope of our work is not single-view reconstruction, thus the comparison with Pixel2Mesh and AtlasNet is not entirely fair. However, both quantitative (see Table 2) and qualitative (Figure 6) results still provide useful insights. Though the rendered output of AtlasNet and Pixel2Mesh in Figure 6 are visually plausible, self-intersections and disconnected surfaces often exist in their results. Figure 7 illustrates this by rendering the output meshes in wireframe mode. Furthermore, as shown in Figure 7, while surface generation methods struggle to capture shape details such as chair handles and car wheels, our method preserves these details that reside in the source mesh.
Evaluation on real images.
We further evaluate our method on real product images that can be found online. For each input image, we select a source model as described before and provide the deformation result. Even though our method has been trained only on synthetic images, we observe that it generalizes to real images as seen in Figure 8. AtlasNet and Pixel2Mesh fail in most cases, while our method is able to generate plausible results by taking advantages of source meshes.
4.3 Ablation Study
We study the importance of different losses and the offset decoder architecture on ShapeNet chair category. We compare our final model to variants including 1) 3DN without the symmetry loss, 2) 3DN without the mesh Laplacian loss, 3) 3DN without the local permutation invariance loss, and 4) fusing global features with midlayer features instead of the original point positions (see the supplemental material for details).
We provide quantitative results in Table 3. Symmetry loss helps the deformation to produce plausible symmetric shapes. Local permutation and Laplacian losses help to obtain smoothness in the deformation field across 3D space and along the mesh surface. However, midlayer fusion makes the network hard to converge to a valid deformation space.
Random Pair Deformation.
In Figure 9 we show deformation results for randomly selected source and target model pairs. While the first column of each row is the source mesh, the first row of each column is the target. Each grid cell shows deformation results for the corresponding source-target pair.
Figure 10 shows shape interpolation results. Each row shows interpolated shapes generated from the two targets and the source mesh. Each intermediate shape is generated using a weighted sum of the global feature representations of the target shapes. Notice how the interpolated shapes gradually deform from the first to the second target.
We test our model trained in Section 4.1 on targets in the form of partial scans produced by RGBD data . We provide results in Figure 11 with different selection of source models. We note that AtlastNet fails on such partial scan input.
We have presented 3DN, an end-to-end network architecture for mesh deformation. Given a source mesh and a target which can be in the form of a 2D image, 3D mesh, or 3D point clouds, 3DN deforms the source by inferring per-vertex displacements while keeping the source mesh connectivity fixed. We compare our method with recent learning based surface generation and deformation networks and show superior results. Our method is not without limitations, however. Certain deformations indeed require to change the source mesh topology, e.g., when deforming a chair without handles to a chair with handles. If large holes exist either in the source or target models, Chamfer and Earth Mover’s distances are challenging to compute since it is possible to generate many wrong point correspondences.
In addition to addressing the above limitations, our future work include extending our method to predict mesh texture by taking advantages of differentiable renderer .
-  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. arxiv, 2015.
-  C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016.
-  H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, 2017.
-  R. Gal, O. Sorkine, N. J. Mitra, and D. Cohen-Or. iwires: An analyze-and-edit approach to shape manipulation. ACM Trans. on Graph., 28(3), 2009.
-  R. Girdhar, D. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In ECCV, 2016.
-  T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. 3d-coded : 3d correspondences by deep deformation. In ECCV, 2018.
-  T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In CVPR, 2018.
-  C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction. In 3DV, 2017.
-  Q. Huang, H. Wang, and V. Koltun. Single-view reconstruction via joint analysis of image and shape collections. ACM Trans. Graph., 2015.
-  Q. Huang, W. Wang, and U. Neumann. Recurrent slice networks for 3d segmentation on point clouds. arXiv preprint arXiv:1802.04402, 2018.
-  D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shirazi, F. Maire, and A. Eriksson. Learning free-form deformations for 3d object reconstruction. In ACCV, 2018.
-  A. Kanazawa, S. Kovalsky, R. Basri, and D. W. Jacobs. Learning 3d deformation of animals from 2d images. In Eurographics, 2016.
-  H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In CVPR, 2018.
-  A. Kurenkov, J. Ji, A. Garg, V. Mehta, J. Gwak, C. Choy, and S. Savarese. Deformnet: Free-form deformation network for 3d shape reconstruction from a single image. arXiv preprint arXiv:1708.04672, 2017.
-  C. Niu, J. Li, and K. Xu. Im2struct: Recovering 3d shape structure from a single rgb image. In CVPR, 2018.
-  J. K. Pontes, C. Kong, S. Sridharan, S. Lucey, A. Eriksson, and C. Fookes. Image2mesh: A learning framework for single image 3d reconstruction. In ACCV, 2017.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  A. Sinha, A. Unmesh, Q. Huang, and K. Ramani. Surfnet: Generating 3d shape surfaces using deep residual networks. In CVPR, 2018.
-  O. Sorkine, D. Cohen-Or, Y. Lipman, M. Alexa, C. Rössl, and H.-P. Seidel. Laplacian surface editing. In Eurographics, 2004.
-  X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. In CVPR, 2018.
-  M. Sung, V. G. Kim, R. Angst, and L. Guibas. Data-driven structural priors for shape completion. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia), 2015.
-  M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In ICCV, 2017.
-  S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.
-  N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018.
-  P.-S. Wang, C.-Y. Sun, Y. Liu, and X. Tong. Adaptive o-cnn: A patch-based deep representation of 3d shapes. arXiv preprint arXiv:1809.07917, 2018.
-  W. Wang, Q. Huang, S. You, C. Yang, and U. Neumann. Shape inpainting using 3d generative adversarial network and recurrent convolutional networks. In ICCV, 2017.
-  W. Wang, R. Yu, Q. Huang, and U. Neumann. Sgpn: Similarity group proposal network for 3d point cloud instance segmentation. In CVPR, 2018.
-  J. Wu, Y. Wang, T. Xue, X. Sun, W. T. Freeman, and J. B. Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In NIPS, 2017.
-  J. Wu, C. Zhang, X. Zhang, Z. Zhang, W. T. Freeman, and J. B. Tenenbaum. Learning shape priors for single-view 3d completion and reconstruction. In NIPS, 2018.
-  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912–1920, 2015.
-  K. Xu, H. Zheng, H. Zhang, D. Cohen-Or, L. Liu, and Y. Xiong. Photo-inspired model-driven 3d object modeling. ACM Trans. Graph., 30(4):80:1–80:10, 2011.
-  X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS, 2016.
-  G. Yang, Y. Cui, S. Belongie, and B. Hariharan. Learning single-view 3d reconstruction with limited pose supervision. In ECCV, 2018.
-  Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In CVPR, 2018.
-  M. E. Yumer and N. J. Mitra. Learning semantic deformation flows with 3d convolutional networks. In ECCV, 2016.
-  R. Zhu, H. Kiani Galoogahi, C. Wang, and S. Lucey. Rethinking reprojection: Closing the loop for pose-aware shape reconstruction from a single image. In ICCV, 2017.
C. Zou, E. Yumer, J. Yang, D. Ceylan, and D. Hoiem.
3d-prnn: Generating shape primitives with recurrent neural networks.In ICCV, 2017.