1 Introduction
With the increasing volume of datasets of deforming objects enabled by modern 3D acquisition technology, the demand for compact data representations and compression grows. Dimensionality reduction of mesh data has multiple applications in computer graphics and vision, including shape retrieval, generation, interpolation, and completion, among others. Recently, deep convolutional autoencoder networks were shown to be able to produce compact mesh representations
[2, 34, 29].Dynamic realworld objects do not deform arbitrarily. While deforming, they preserve topology, and nearby points are more likely to deform similarly compared to more distant points. Current convolutional mesh autoencoders exploit this coherence by learning the deformation properties of objects directly from data and are already suitable for mesh compression and representation learning. On the other hand, they do not explicitly reason about the deformation field in terms of local rotations and translations. We show that explicitly reasoning about the local rigidity of meshes enables higherquality results for highly deformable objects, compared to directly regressing vertex positions.
At the other end of the spectrum, mesh manipulation techniques such as AsRigidAsPossible Deformation [31] and Embedded Deformation [32] only require a single mesh and enforce deformation properties, such as smoothness and local rigidity, based on a set of handcrafted priors. These handcrafted priors are effective and work surprisingly well, but since they do not model the realworld deformation behavior of the physical object, they often lead to unrealistic deformations and artifacts in the reconstructions.
In this paper, we propose a generalpurpose mesh autoencoder with a modelbased deformation layer, combining the best of both worlds, i.e.,supervised learning with deformable meshes and a novel differentiable embedded deformation layer that models the deformable meshes using lowerdimensional deformation graphs with physically interpretable deformation parameters. While the core of our DEep MEsh Autoencoder (DEMEA) learns the deformation model of objects from data using the stateoftheart convolutional mesh autoencoder (CoMA) [29], the novel embedded deformation layer decouples the parameterization of object motion from the mesh resolution and introduces local spatial coherence via vertex skinning.
DEMEA is trained on mesh datasets of moderate sizes that have recently become available [22, 4, 3, 24]. DEMEA is a general mesh autoencoding approach that can be trained for any deformable object class. We evaluate our approach on datasets of three objects with large deformations like articulated deformations (body, hand) and large nonlinear deformations (cloth), and one object with small localized deformations (face). Quantitatively, DEMEA outperforms standard convolutional mesh autoencoder architectures in terms of vertextovertex distance error. Qualitatively, we show that DEMEA produces visually higher fidelity results due to the physically based embedded deformation layer.
We show several applications of DEMEA in computer vision and graphics. Once trained, the decoder of our autoencoders can be used for shape compression, highquality depthtomesh reconstruction of human bodies and hands, and even poorly textured RGBimagetomesh reconstruction for deforming cloth. The lowdimensional latent space learned by our approach is meaningful and wellbehaved, which we demonstrate by linearly interpolating between the latent codes of different meshes. Thus, DEMEA provides us a wellbehaved generalpurpose categoryspecific generative model of highly deformable objects.
2 Related Work
Mesh Manipulation and Tracking. Our embedded deformation layer is inspired by asrigidaspossible modelling [31] and the method of Sumner et al. [32] for mesh editing and manipulation. While these methods have been shown to be very useful for mesh manipulation in computer graphics, to the best of our knowledge, this is the first time a modelbased regularizer is used in a mesh autoencoder.
Using a template for nonrigid object tracking from depth maps was extensively studied in the modelbased setting [20, 38]. Recently, Litany et al. [21]
demonstrated a neural networkbased approach for the completion of human body shapes from a single depth map.
Graph Convolutions. The encoderdecoder approach to dimensionality reduction with neural networks (NNs) for images was introduced in [15]
. With deeper architectures encompassing a large number of parameters, learning can be performed on large data structures including deformable meshes. Deep convolutional neural networks (CNNs) allow to effectively capture contextual information of input data modalities and can be trained for various tasks. Lately, convolutions operating on regular grids have been generalized to more general topologically connected structures such as meshes and twodimensional manifolds
[6, 27], enabling learning of correspondences between shapes, shape retrieval [25, 5, 26], and segmentation [37].Masci et al. [25]
proposed geodesic CNNs operating on Riemannian manifolds for shape description, retrieval, and correspondence estimation. Boscani
et al. [5] introduced spatial weighting functions based on simulated heat propagation and projected anisotropic convolutions. Monti et al. [26]extended graph convolutions to variable patches through Gaussian mixture model CNNs. In FeaSTNet
[35], the correspondences between filter weights and graph neighborhoods with arbitrary connectivities are established dynamically from the learned features. The localized spectral interpretation of Defferrard et al. [7] is based on recursive feature learning with Chebyshev polynomials and has linear evaluation complexity.Learning MeshBased 3D Autoencoders. Very recently, several mesh autoencoders with various applications were proposed. A new hierarchical variational mesh autoencoder with fully connected layers for facial geometry parameterization learns an accurate face model from small databases and accomplishes depthtomesh fitting tasks [2]. Tan and coworkers [33] introduced a mesh autoencoder with a rotationinvariant mesh representation as a generative model. Their network can generate new meshes by sampling in the latent space and perform mesh interpolation. To cope with meshes of arbitrary connectivity, they use fullyconnected layers and do not explicitly encode neighbor relations. Tan et al. [34] train a network with graph convolutions to extract sparse localized deformation components from meshes. Their method is suitable for largescale deformations and meshes with irregular connectivity. Gao et al. [10] transfer mesh deformations by training a generative adversarial network with a cycle consistency loss to map shapes in the latent space, while a variational mesh autoencoder encodes deformations. The Convolutional facial Mesh Autoencoder (CoMA) of Ranjan et al. [29] allows to model and sample stronger deformations compared to previous methods and supports asymmetric facial expressions.
Our DEMEA is a generalpurpose mesh autoencoder that can be used for shape completion, shape interpolation, and even surface reconstruction from monocular images using shading cues.
Similar to CoMA [29], our DEMEA uses spectral graph convolutions but additionally employs the embedded deformation layer as a modelbased regularizer.
While most of these approaches show results only on a single object category, we demonstrate the usefulness of our approach through evaluations on three datasets of highly deformable objects.
We believe that it is substantial to accommodate point relationships of the mesh data in the architecture if the connectivities are available.
Learning 3D Reconstruction.
Several supervised methods reconstruct rigid objects in 3D. Given a depth image, the network of Sinha et al. [30] reconstructs the observed surface of nonrigid objects. In its 3D reconstruction mode, their method reconstructs rigid objects from single images. Similarly, Groueix et al. [13] reconstruct object surfaces from a point cloud or single monocular image with an atlas parameterization.
The approaches of Kurenkov et al. [19] and Jack et al. [16] deform a predefined objectclass template to match the observed object appearance in an image. Similarly, Kanazawa et al. [17] deform a template to match the object appearance but additionally support object texture. The Pixel2Mesh approach of Wang et al. [36] reconstructs an accurate mesh of an object in a segmented image. Initializing the 3D reconstruction with an ellipsoid, they gradually deform it until the appearance matches the observation. The templatebased approaches [19, 16, 17], as well as Pixel2Mesh [36], produce complete 3D meshes.
Learning Monocular NonRigid Surface Regression. Only a few supervised learning approaches for 3D reconstruction from monocular images tackle the deformable nature of nonrigid objects. Pumarola et al. [28] and Golyanik et al. [12] train networks for deformation models with synthetic thin plates datasets. Their methods can infer nonrigid states of the observed surfaces such as paper sheets or membranes. The accuracy and robustness of both methods on real images are limited. Bednařík et al. [3] propose an encoderdecoder network for textureless surfaces relying on shading cues. They train on a real dataset and show an enhanced reconstruction accuracy on real images, but support only trained object classes. FuentesJimenez et al. [9] train a network to deform an object template for depth map recovery. They achieve impressive results on real image sequences but require an accurate 3D model of every object in the scene, which restricts the method’s practicality.
One of the applications of DEMEA is the recovery of textureless surfaces from RGB images. Since a depth map as a data modality is closer to images with shaded surfaces, we train DEMEA in the depthtomesh mode on images instead of depth maps. As a result, we can regress surface geometry from shading cue.
3 Approach
In this section, we describe the architecture of the proposed DEMEA. We employ an expertdesigned embedded deformation layer to decouple the complexity of the learned deformation field from the actual mesh resolution. The deformation is represented relative to a canonical mesh with vertices , and edges . To this end, we define the encoderdecoder on a coarse deformation graph and use the embedded deformation layer to drive the deformation of the final highresolution mesh, see Fig. 1. Our architecture is based on spectral graph convolutions that are defined on a multiresolution graph hierarchy. In the following, we describe all components in more detail.
3.1 Mesh Hierarchy
The up and downsampling in the convolutional mesh autoencoder is defined over a multiresolution mesh hierarchy, similar to the CoMA [29] architecture. We compute the mesh hierarchy fully automatically based on quadric edge collapses [11], i.e., each hierarchy level is a simplified version of the input mesh. We employ a hierarchy with five resolution levels, where the finest level is the mesh. Given the multiresolution graph hierarchy, we define up and downsampling operations [29] for feature maps defined on the graph. To this end, during downsampling, we enforce the nodes of the coarser level to be a subset of the nodes of the next finer level. We transfer a feature map to the next coarser level by a similar subsampling operation. The inverse operation, i.e., feature map upsampling, is implemented based on a barycentric interpolation of close features. During edge collapse, we project each collapsed node onto the closest triangle of the coarser level. We use the barycentric coordinates of this closest point with respect to the triangle’s vertices to define the interpolation weights.
3.2 Embedded Deformation Layer (EDL)
Given a canonical mesh for an object category, we design a corresponding coarse embedded deformation graph, see Fig. 2.
The deformation graph is used as one of the two levels immediately below the mesh in the mesh hierarchy (depending on the resolution of the graph) of the autoencoder. As the quadric edge collapse algorithm can delete nodes of the embedded graph when computing intermediate levels of the graph hierarchy, we modify the algorithm to ensure that the nodes of the embedded graph are not removed from finer levels. The number of nodes in the deformation graph is kept significantly lower than the mesh resolution, and highly deformable regions (arms and legs in the case of bodies) are assigned relatively more nodes.
Our embedded deformation layer models a space deformation that maps the vertices of the canonical template mesh to a deformed version . Suppose is the embedded deformation graph [32] with canonical nodes and edges , with . The global space deformation is defined by a set of local, rigid, pergraph node transformations. Each local rigid space transformation is defined by a tuple , with being a rotation matrix and being a translation vector. We enforce that and by parameterizing the rotation matrices based on three Euler angles. Each is anchored at the canonical node position and maps every point to a new position in the following manner [32]:
(1) 
To obtain the final global space deformation , the local pernode transformations are linearly combined:
(2) 
Here, is the set of approximate closest deformation nodes. The linear blending weights for each position are based on the distance to the respective deformation node [32]. Please refer to the supplemental for more details.
The deformed mesh is obtained by applying the global space deformation to the canonical template mesh . The free parameters are the local pernode rotations and translations , i.e., parameters with being the number of nodes in the graph. These parameters are input to our deformation layer and are regressed by the graph convolutional decoder.
3.3 Differentiable Space Deformation
Our novel EDL is fully differentiable and can be used during network training to decouple the parameterization of the space deformation from the resolution of the final highresolution output mesh. This enables us to define the reconstruction loss on the final highresolution output mesh and backpropagate the errors via the skinning transform to the coarse parameterization of the space deformation. Thus, our approach enables finding the best space deformation by only supervising the final output mesh.
3.4 Spectral Graph Convolutions
Our graph encoderdecoder architecture is based on fast localized spectral filtering [7]. Given an
channel feature tensor
, where the features are defined at the graph nodes, and let denote the th input graph feature map, we define the th output graph feature map as follows:(3) 
Here, is the Laplacian matrix of the graph and the filters are parameterized using Chebyshev polynomials of order . This leads to localized filters that operate on the neighbourhoods of the nodes. The complete output feature tensor, that stacks all feature maps, is denoted as . Each filter is parameterized by coefficients, which in total leads to trainable parameters for each graph convolution layer, see [7]
for more details. We apply the graph convolutions without stride,
i.e., input graph resolution equals output resolution.3.5 Training
We train our approach endtoend in Tensorflow
[1] using Adam [18]. As loss we employ a dense geometric pervertex loss with respect to the groundtruth mesh. For all experiments, we use a learning rate of and default parameters , , for Adam. We train for epochs for Dynamic Faust, epochs for SynHand5M, epochs for the CoMA dataset and epochs for the Cloth dataset. We employ a batch size of .3.6 Reconstructing Meshes from Images/Depth
The image/depthtomesh network consists of an image encoder and a mesh decoder, see Fig. 3. The mesh decoder is initialized from the corresponding mesh autoencoder, the image/depth encoder is based on a ResNet50 [14]
architecture, and the latent code is shared between the encoder and decoder. We initialize the ResNet50 component using pretrained weights from ImageNet
[8]. To obtain training data, we render synthetic depth maps from the meshes. We train with the same settings as for mesh autoencoding.3.7 Network Architecture Details
In the following, we provide more details of our encoderdecoder architectures.
Encoding Meshes.
Input to the first layer of our mesh encoder is an tensor that stacks the coordinates of all vertices.
We apply four downsampling modules
. Each module applies a graph convolution and is followed by a downsampling to the next coarser level of the graph hierarchy. After each graph convolution, we apply a ReLU nonlinearity. Finally, we take the output of the final module and apply a fully connected layer followed by a ReLU nonlinearity to obtain a latent space embedding.
Encoding Images/Depth. To encode images/depth, we employ a 2D convolutional network to map color/depth input to a latent space embedding. Input to our encoder are images of resolution pixels. We modified the ResNet50 [14] architecture to take single or threechannel input image. We furthermore added two additional convolution layers at the end, which are followed by global average pooling. Finally, a fully connected layer with a subsequent ReLU nonlinearity maps the activations to the latent space. Decoding Graphs. The task of the graph decoder is to map from the latent space back to the embedded deformation graph. First, we employ a fully connected layer in combination with reshaping to obtain the input to the graph convolutional upsampling modules. We apply a sequence of three or four upsampling modules until the resolution level of the embedded graph is reached. Each upsampling module first upsamples the features to the next finer graph resolution and then performs a spectral graph convolution (with ), which is then followed by a ReLU nonlinearity. Then, we apply three additional graph convolutions, where we apply ReLUs after each of the first two. The latter two of these graph convolutions work with for local refinement. The resulting tensor is passed to our expertdesigned embedded deformation layer.
4 Experiments
We evaluate DEMEA quantitatively and qualitatively on several challenging datasets and demonstrate stateoftheart results for mesh autoencoding.
In Sec. 5, we show reconstruction from RGB images and depth maps and that the learned latent space enables wellbehaved interpolation.
Datasets.
We demonstrate the generality of DEMEA on experiments with body (Dynamic Faust, DFaust [4]), hand (SynHand5M [24]), textureless cloth (Cloth [3]), and face (CoMA [29]) datasets.
Mesh  1st  2nd  3rd  4th  

DFaust [4]  6890  1723  352  88  22 
CoMA [29]  5023  2525  632  158  40 
SynHand5M [24]  1193  400  100  25  7 
Cloth [3]  961  256  100  36  16 
Table 1 gives the number of graph nodes used on each level of our hierarchical encoderdecoder architecture.
All meshes live in metric space.
DFaust [4].
The training set consists of 28,294 meshes.
For the tests, we split off two identities (female 50004, male 50002) and two dynamic performances, i.e., oneleg jump and chicken wings.
Overall, this results in a test set with elements.
For the depthtomesh results, we found the synthetic depth maps from the DFaust training set to be insufficient for generalization, i.e., the test error was high.
Thus, we add more pose variety to DFaust for the depthtomesh experiments.
Specifically, we add randomly sampled poses from the CMU Mocap^{1}^{1}1mocap.cs.cmu.edu dataset to the training data, where the identities are randomly sampled from the SMPL [23] model (14 female, 14 male).
We also add 12 such samples to the test set (6 female, 6 male).
Textureless Cloth [3].
For evaluating our approach on general nonrigidly deforming surfaces, we use the textureless cloth data set of Bednařík et al. [3].
It contains real depth maps and images of a white deformable sheet — observed in different states and differently shaded — as well as ground truth meshes.
In total, we select 3,861 meshes with consistent edge lengths.
3,167 meshes are used for training and meshes are reserved for evaluation.
For this dataset, we handdesign the entire graph hierarchy, since the canonical mesh is a perfectly flat sheet, which causes the downsampling method [11] to introduce severe artifacts.
SynHand5M [24].
For the experiments with hands, we take random meshes from the synthetic SynHand5M dataset of Malik et al. [24].
We render the corresponding depth maps.
The training set is comprised of meshes, and the remaining meshes are used for evaluation.
CoMA [29].
The training set contains 17,794 meshes of the human face in various expressions [29].
For tests, we select two challenging expressions, i.e., high smile and mouth extreme.
Thus, our test set contains 2,671 meshes in total.
4.1 Baseline Architectures
We compare our convolutional architecture with embedded deformation (DEMEA) to a number of strong baselines.
Convolutional Baseline.
We consider a version of our proposed architecture, convolutional ablation (CA), where the expertdesigned ED layer is replaced by learned upsampling modules that upsample to the mesh resolution.
In this case, the local refinement (convolutions with ) occurs on the level of the embedded graph.
We also consider modified CA (MCA), an architecture where the local graph convolutions are moved to the end of the network for local refinement on the mesh resolution.
FullyConnected Baseline.
We also consider an almostlinear baseline, FC ablation (FCA) .
The input is given to a fullyconnected layer, after which a ReLU is applied.
The resulting latent vector is decoded using another FC layer that maps to the output space.
Finally, we also consider an FCED network where the fullyconnected decoder maps to the deformation graph, which the embedded deformation layer (EDL) in turn maps to the fullresolution mesh.
4.2 Evaluation Settings
FC IG  FC IM  GC IG  GC IM  

GL  2.6  8.9  2.4  2.4 
Ours  2.2  2.3  2.3  2.4 
We first determine the most favorable input type and loss function for the considered architectures. As input, we consider either the full mesh (
IM) or the subset of vertices that is used to define the embedded deformation graph (IG). In addition to our proposed loss function, we consider the graph loss (GL) with the reconstruction loss directly on the graph node positions (where the vertex positions of the input mesh that correspond to the graph nodes are used as ground truth). The GL setting uses the EDL only at test time to map to the full mesh, but not for training. We perform this evaluation for both fullyconnected (FC) and graphconvolutional (GC) architectures and train for epochs.Table 2 shows the quantitative results using the average pervertex Euclidean error. Using the EDL during training leads to better quantitative results, as the network is aware of the skinning function and can move the graph nodes accordingly. Fullyconnected networks perform worse with the mesh as input, perhaps due to the large number of parameters in the network. Graph convolutions by design perform local computations and thus require a much smaller number of free variables during training. In all further experiments, we use graphs as inputs for the fullyconnected architectures and meshes as inputs for the convolutional architectures, to choose the strongest baselines. We always use the EDL during training in all further results.
4.3 Evaluations of the Autoencoder
Qualitative Evaluations.
Our architecture significantly outperforms the baselines qualitatively on the DFaust and SynHand5M datasets, as seen in Figs. 4 and 5.
Convolutional architectures without an embedded graph produce strong artifacts in the hand, feet and face regions in the presence of large deformations. Since EDL explicitly models deformations, we preserve fine details under strong nonlinear deformations and articulations of extremities.
Quantitative Evaluations.
We compare the proposed DEMEA to the baselines on the autoencoding task, see Table 3.
DFaust [4]  SynHand5M [24]  Cloth [3]  CoMA [29]  

8  32  128  8  32  128  8  32  128  8  32  128  
CA  6.7  3.0  2.6  10.30  4.49  3.76  1.61  0.90  0.72  1.57  0.97  0.87 
MCA  8.6  3.4  2.5  9.33  4.55  3.67  1.75  0.82  0.70  1.61  0.99  0.87 
Ours  6.6  2.9  2.4  8.97  4.67  3.53  1.34  0.83  0.71  1.49  1.05  0.94 
FCA  9.3  3.4  2.2  20.96  7.22  1.44  1.71  0.71  0.44  3.19  1.39  0.75 
FCED  8.2  3.1  2.2  20.49  9.13  1.60  1.89  0.69  0.44  3.61  3.19  1.08 
While the fullyconnected baselines are competitive for larger dimensions of the latent space, their memory demand increases drastically. On the other hand, they perform significantly worse for low dimensions on all datasets. In this work, we are interested in low latent dimensions, e.g. less than 32, as we want to learn mesh representations that are as compact as possible. In the experiments, we predominantly evaluate with latent dimensions and .
For all datasets, DEMEA outperforms all considered baselines for latent codes of length .
For a latent dimension of , the gap in the accuracy shrinks.
With an increasing dimensionality of the latent code, competing architectures obtain comparable or better results quantitatively, perhaps because they become capable of fitting to the highfrequency details and noise. On the other hand, we are interested in capturing smooth large nonrigid deformations.
The baselines achieve comparable accuracy with different sizes of the latent space for different datasets.
In the case of the face dataset [29], the closest baselines are on par even for latent codes of size .
Since faces deform locally, is sufficient to capture the deformations around the mean shape using standard architectures.
As the latent code becomes larger, fullyconnected networks consistently outperform all convolutional architectures, as discussed before.
Comparisons.
In extensive comparisons with several competitive baselines, we have demonstrated the usefulness of our approach for autoencoding strong nonlinear deformations and articulated motion.
Next, we compare DEMEA to the existing stateoftheart CoMA approach [29].
We train their architecture on all mentioned datasets with a latent dimension of , which is also used in [29].
We outperform their method quantitatively on DFaust ( vs. ), on SynHand5M ( vs. ), and on Cloth ( vs. ). We perform worse on Faces ( vs. ), where the deformations are not large.
On the other datasets, the advantage of our explicit EDL formulation is clearly noticeable qualitatively.
In Fig. 4, we show that DEMEA avoids many of the artifacts present in the case of [29] and other baselines.
5 Applications
We show several applications of DEMEA, including imagetomesh reconstruction and deformation transfer.
5.1 RGB to Mesh
On the Cloth [3] dataset, we show that DEMEA can reconstruct meshes from RGB images. See Fig. 6 for qualitative examples using a latent dimension of .
On our test set, our proposed architecture achieves RGBtomesh reconstruction errors of , and for latent dimensions , and , respectively. Bednařík et al. [3], who use a different split than us, report an error of . Moreover, we asked the authors of Hybrid Deformation Model Network (HDMnet) [12] to train their method for regression of textureless surfaces. On their split, HDMNet achieves an error of after training for 100 epochs using a batch size of 4. Under the same settings, we retrain our approach without pretraining the mesh decoder. Our approach obtains test errors of , and using latent dimensions of , and , respectively.
5.2 Depth to Mesh
For hands and bodies, we demonstrate reconstruction results from single depth images.
Bodies.
We train networks with a small latent space dimension of and a larger dimension of .
Quantitatively, we obtain errors of and with latent space dimensions of and , respectively, on unaugmented synthetic data.
Besides, we also apply our approach to real data, see Fig. 7.
To this end, we found it necessary to augment the depth images with artificial noise to lessen the domain gap.
Video results are included in the supplementary.
Hands.
DEMEA can reconstruct hands from depth as well, see Fig. 8.
5.3 Latent Space Arithmetic
Although we do not employ any regularization on the latent space, we found empirically that the network learns a wellbehaved latent space.
As we show in the supplemental document and video, this allows DEMEA to temporally smooth tracked meshes from a depth stream.
Latent Interpolation.
We can linearly interpolate the latent vectors and of a source and a target mesh: .
Even for highly different poses and identities, decoding these interpolated latent vectors yields plausible inbetween meshes, see Fig. 9.
Deformation Transfer. Furthermore, the learned latent space even allows to transfer poses between different identities on DFaust. Let a sequence of source meshes of person and a target mesh of person be given, where w.l.o.g. and correspond to the same pose. We now seek a sequence of target meshes of person performing the same poses as person in . We encode and into the latent space of the mesh autoencoder, yielding the corresponding latent vectors and . We define the identity difference and set for . Decoding using the mesh decoder than yields . We show qualitative results in Fig. 10 and in the supplementary.
6 Limitations
While the embedded deformation graph excels on highly articulated, nonrigid motions, it has difficulties accounting for very subtle actions. Since the faces in the CoMA [29]
dataset do not undergo large deformations, our EDLbased architecture does not offer a significant advantage. Similar to all other 3D deep learning techniques, our approach also requires reasonably sized mesh datasets for supervised training, which might be difficult to capture or model. We train our network in an objectspecific manner. Generalizing our approach across different object categories is an interesting direction for future work.
7 Conclusion
We proposed DEMEA — the first deep mesh autoencoder for highly deformable and articulated scenes, such as human bodies, hands, and deformable surfaces, that builds on a new differentiable embedded deformation layer. The deformation layer reasons about local rigidity of the mesh and allows us to achieve higher quality autoencoding results compared to several baselines and existing approaches. We have shown multiple applications of our architecture including nonrigid reconstruction from real depth maps and 3D reconstruction of textureless surfaces from images.
Acknowledgments. This work was supported by the ERC Consolidator Grant 4DReply (770784), the Max Planck Center for Visual Computing and Communications (MPCVCC), and an Oculus research grant.
References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.  [2] T. Bagautdinov, C. Wu, J. Saragih, Y. Sheikh, and P. Fua. Modeling facial geometry using compositional vaes. 2018.
 [3] J. Bednařík, P. Fua, and M. Salzmann. Learning to reconstruct textureless deformable surfaces. In International Conference on 3D Vision (3DV), 2018.

[4]
F. Bogo, J. Romero, G. PonsMoll, and M. J. Black.
Dynamic FAUST: Registering human bodies in motion.
In
Computer Vision and Pattern Recognition (CVPR)
, 2017.  [5] D. Boscaini, J. Masci, E. Rodoià, and M. Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In International Conference on Neural Information Processing Systems (NIPS), pages 3197–3205, 2016.
 [6] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. CoRR, abs/1312.6203, 2013.
 [7] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In International Conference on Neural Information Processing Systems (NIPS), NIPS’16, pages 3844–3852, 2016.
 [8] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 [9] D. FuentesJimenez, D. CasillasPerez, D. Pizarro, T. Collins, and A. Bartoli. Deep ShapefromTemplate: WideBaseline, Dense and Fast Registration and Deformable Reconstruction from a Single Image. arXiv eprints, 2018.
 [10] L. Gao, J. Yang, Y.L. Qiao, Y.K. Lai, P. L. Rosin, W. Xu, and S. Xia. Automatic unpaired shape deformation transfer. ACM Trans. Graph., 37(6):237:1–237:15, 2018.
 [11] M. Garland and P. S. Heckbert. Surface simplification using quadric error metrics. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’97, pages 209–216, New York, NY, USA, 1997. ACM Press/AddisonWesley Publishing Co.
 [12] V. Golyanik, S. Shimada, K. Varanasi, and D. Stricker. Hdmnet: Monocular nonrigid 3d reconstruction with learned deformation model. In International Conference on Virtual Reality and Augmented Reality (EuroVR), pages 51–72, 2018.
 [13] T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. AtlasNet: A PapierMâché Approach to Learning 3D Surface Generation. In Computer Vision and Pattern Recognition (CVPR), 2018.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pages 770–778, 2016.
 [15] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 [16] D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shirazi, F. Maire, and A. Eriksson. Learning freeform deformations for 3d object reconstruction. In Asian Conference on Computer Vision (ACCV), 2018.
 [17] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning categoryspecific mesh reconstruction from image collections. In European Conference on Computer Vision (ECCV), 2018.
 [18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
 [19] A. Kurenkov, J. Ji, A. Garg, V. Mehta, J. Gwak, C. Choy, and S. Savarese. Deformnet: Freeform deformation network for 3d shape reconstruction from a single image. In Winter Conference on Applications of Computer Vision, 2018.
 [20] H. Li, B. Adams, L. J. Guibas, and M. Pauly. Robust singleview geometry and motion reconstruction. In ACM SIGGRAPH Asia, pages 175:1–175:10, 2009.
 [21] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. In Computer Vision and Pattern Recognition (CVPR), 2018.
 [22] M. Loper, N. Mahmood, and M. J. Black. Mosh: Motion and shape capture from sparse markers. ACM Trans. Graph., 33(6):220:1–220:13, 2014.
 [23] M. Loper, N. Mahmood, J. Romero, G. PonsMoll, and M. J. Black. Smpl: A skinned multiperson linear model. ACM Trans. Graph., 34(6):248:1–248:16, 2015.
 [24] J. Malik, A. Elhayek, F. Nunnari, K. Varanasi, K. Tamaddon, A. Héloir, and D. Stricker. Deephps: Endtoend estimation of 3d hand pose and shape by learning from synthetic depth. 2018 International Conference on 3D Vision (3DV), pages 110–119, 2018.
 [25] J. Masci, D. Boscaini, M. M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In International Conference on Computer Vision Workshop (ICCVW), pages 832–840, 2015.
 [26] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. pages 5425–5434, 07 2017.
 [27] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In International Conference on Machine Learning (ICML), volume 48, pages 2014–2023, 2016.
 [28] A. Pumarola, A. Agudo, L. Porzi, A. Sanfeliu, V. Lepetit, and F. MorenoNoguer. Geometryaware network for nonrigid shape prediction from a single view. In Computer Vision and Pattern Recognition (CVPR), 2018.
 [29] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Generating 3D faces using convolutional mesh autoencoders. In European Conference on Computer Vision (ECCV), pages 725–741, 2018.
 [30] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani. Surfnet: Generating 3d shape surfaces using deep residual networks. In Computer Vision and Pattern Recognition (CVPR), 2017.
 [31] O. Sorkine and M. Alexa. Asrigidaspossible surface modeling. In Eurographics Symposium on Geometry Processing (SGP), pages 109–116, 2007.
 [32] R. W. Sumner, J. Schmid, and M. Pauly. Embedded deformation for shape manipulation. In ACM SIGGRAPH, 2007.
 [33] Q. Tan, L. Gao, Y.K. Lai, and S. Xia. Variational autoencoders for deforming 3d mesh models. In Computer Vision and Pattern Recognition (CVPR), 2018.
 [34] Q. Tan, L. Gao, Y.K. Lai, J. Yang, and S. Xia. Meshbased autoencoders for localized deformation component analysis. In AAAI, 2018.
 [35] N. Verma, E. Boyer, and J. Verbeek. FeaStNet: FeatureSteered Graph Convolutions for 3D Shape Analysis. In Computer Vision and Pattern Recognition (CVPR), pages 2598–2606, 2018.
 [36] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In European Conference on Computer Vision (ECCV), 2018.
 [37] L. Yi, H. Su, X. Guo, and L. Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In Computer Vision and Pattern Recognition (CVPR), pages 6584–6592, 2017.
 [38] M. Zollhöfer, M. Nießner, S. Izadi, C. Rhemann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, and M. Stamminger. Realtime nonrigid reconstruction using an rgbd camera. ACM Transactions on Graphics (TOG), 33(4), 2014.
Comments
There are no comments yet.