DEMEA: Deep Mesh Autoencoders for Non-Rigidly Deforming Objects

05/24/2019 ∙ by Edgar Tretschk, et al. ∙ 0

Mesh autoencoders are commonly used for dimensionality reduction, sampling and mesh modeling. We propose a general-purpose DEep MEsh Autoencoder (DEMEA) which adds a novel embedded deformation layer to a graph-convolutional mesh autoencoder. The embedded deformation layer (EDL) is a differentiable deformable geometric proxy which explicitly models point displacements of non-rigid deformations in a lower dimensional space and serves as a local rigidity regularizer. DEMEA decouples the parameterization of the deformation from the final mesh resolution since the deformation is defined over a lower dimensional embedded deformation graph. We perform a large-scale study on four different datasets of deformable objects. Reasoning about the local rigidity of meshes using EDL allows us to achieve higher-quality results for highly deformable objects, compared to directly regressing vertex positions. We demonstrate multiple applications of DEMEA, including non-rigid 3D reconstruction from depth and shading cues, non-rigid surface tracking, as well as the transfer of deformations over different meshes.



There are no comments yet.


page 2

page 6

page 7

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the increasing volume of datasets of deforming objects enabled by modern 3D acquisition technology, the demand for compact data representations and compression grows. Dimensionality reduction of mesh data has multiple applications in computer graphics and vision, including shape retrieval, generation, interpolation, and completion, among others. Recently, deep convolutional autoencoder networks were shown to be able to produce compact mesh representations

[2, 34, 29].

Dynamic real-world objects do not deform arbitrarily. While deforming, they preserve topology, and nearby points are more likely to deform similarly compared to more distant points. Current convolutional mesh autoencoders exploit this coherence by learning the deformation properties of objects directly from data and are already suitable for mesh compression and representation learning. On the other hand, they do not explicitly reason about the deformation field in terms of local rotations and translations. We show that explicitly reasoning about the local rigidity of meshes enables higher-quality results for highly deformable objects, compared to directly regressing vertex positions.

At the other end of the spectrum, mesh manipulation techniques such as As-Rigid-As-Possible Deformation [31] and Embedded Deformation [32] only require a single mesh and enforce deformation properties, such as smoothness and local rigidity, based on a set of hand-crafted priors. These hand-crafted priors are effective and work surprisingly well, but since they do not model the real-world deformation behavior of the physical object, they often lead to unrealistic deformations and artifacts in the reconstructions.

In this paper, we propose a general-purpose mesh autoencoder with a model-based deformation layer, combining the best of both worlds, i.e.,supervised learning with deformable meshes and a novel differentiable embedded deformation layer that models the deformable meshes using lower-dimensional deformation graphs with physically interpretable deformation parameters. While the core of our DEep MEsh Autoencoder (DEMEA) learns the deformation model of objects from data using the state-of-the-art convolutional mesh autoencoder (CoMA) [29], the novel embedded deformation layer decouples the parameterization of object motion from the mesh resolution and introduces local spatial coherence via vertex skinning.

DEMEA is trained on mesh datasets of moderate sizes that have recently become available [22, 4, 3, 24]. DEMEA is a general mesh autoencoding approach that can be trained for any deformable object class. We evaluate our approach on datasets of three objects with large deformations like articulated deformations (body, hand) and large non-linear deformations (cloth), and one object with small localized deformations (face). Quantitatively, DEMEA outperforms standard convolutional mesh autoencoder architectures in terms of vertex-to-vertex distance error. Qualitatively, we show that DEMEA produces visually higher fidelity results due to the physically based embedded deformation layer.

We show several applications of DEMEA in computer vision and graphics. Once trained, the decoder of our autoencoders can be used for shape compression, high-quality depth-to-mesh reconstruction of human bodies and hands, and even poorly textured RGB-image-to-mesh reconstruction for deforming cloth. The low-dimensional latent space learned by our approach is meaningful and well-behaved, which we demonstrate by linearly interpolating between the latent codes of different meshes. Thus, DEMEA provides us a well-behaved general-purpose category-specific generative model of highly deformable objects.

2 Related Work

Mesh Manipulation and Tracking. Our embedded deformation layer is inspired by as-rigid-as-possible modelling [31] and the method of Sumner et al. [32] for mesh editing and manipulation. While these methods have been shown to be very useful for mesh manipulation in computer graphics, to the best of our knowledge, this is the first time a model-based regularizer is used in a mesh autoencoder.

Using a template for non-rigid object tracking from depth maps was extensively studied in the model-based setting [20, 38]. Recently, Litany et al. [21]

demonstrated a neural network-based approach for the completion of human body shapes from a single depth map.

Graph Convolutions. The encoder-decoder approach to dimensionality reduction with neural networks (NNs) for images was introduced in [15]

. With deeper architectures encompassing a large number of parameters, learning can be performed on large data structures including deformable meshes. Deep convolutional neural networks (CNNs) allow to effectively capture contextual information of input data modalities and can be trained for various tasks. Lately, convolutions operating on regular grids have been generalized to more general topologically connected structures such as meshes and two-dimensional manifolds

[6, 27], enabling learning of correspondences between shapes, shape retrieval [25, 5, 26], and segmentation [37].

Masci et al. [25]

proposed geodesic CNNs operating on Riemannian manifolds for shape description, retrieval, and correspondence estimation. Boscani

et al. [5] introduced spatial weighting functions based on simulated heat propagation and projected anisotropic convolutions. Monti et al. [26]

extended graph convolutions to variable patches through Gaussian mixture model CNNs. In FeaSTNet 

[35], the correspondences between filter weights and graph neighborhoods with arbitrary connectivities are established dynamically from the learned features. The localized spectral interpretation of Defferrard et al. [7] is based on recursive feature learning with Chebyshev polynomials and has linear evaluation complexity.
Learning Mesh-Based 3D Autoencoders. Very recently, several mesh autoencoders with various applications were proposed. A new hierarchical variational mesh autoencoder with fully connected layers for facial geometry parameterization learns an accurate face model from small databases and accomplishes depth-to-mesh fitting tasks [2]. Tan and coworkers [33] introduced a mesh autoencoder with a rotation-invariant mesh representation as a generative model. Their network can generate new meshes by sampling in the latent space and perform mesh interpolation. To cope with meshes of arbitrary connectivity, they use fully-connected layers and do not explicitly encode neighbor relations. Tan et al. [34] train a network with graph convolutions to extract sparse localized deformation components from meshes. Their method is suitable for large-scale deformations and meshes with irregular connectivity. Gao et al. [10] transfer mesh deformations by training a generative adversarial network with a cycle consistency loss to map shapes in the latent space, while a variational mesh autoencoder encodes deformations. The Convolutional facial Mesh Autoencoder (CoMA) of Ranjan et al. [29] allows to model and sample stronger deformations compared to previous methods and supports asymmetric facial expressions.

Our DEMEA is a general-purpose mesh autoencoder that can be used for shape completion, shape interpolation, and even surface reconstruction from monocular images using shading cues. Similar to CoMA [29], our DEMEA uses spectral graph convolutions but additionally employs the embedded deformation layer as a model-based regularizer. While most of these approaches show results only on a single object category, we demonstrate the usefulness of our approach through evaluations on three datasets of highly deformable objects. We believe that it is substantial to accommodate point relationships of the mesh data in the architecture if the connectivities are available.
Learning 3D Reconstruction. Several supervised methods reconstruct rigid objects in 3D. Given a depth image, the network of Sinha et al. [30] reconstructs the observed surface of non-rigid objects. In its 3D reconstruction mode, their method reconstructs rigid objects from single images. Similarly, Groueix et al. [13] reconstruct object surfaces from a point cloud or single monocular image with an atlas parameterization. The approaches of Kurenkov et al. [19] and Jack et al. [16] deform a predefined object-class template to match the observed object appearance in an image. Similarly, Kanazawa et al. [17] deform a template to match the object appearance but additionally support object texture. The Pixel2Mesh approach of Wang et al. [36] reconstructs an accurate mesh of an object in a segmented image. Initializing the 3D reconstruction with an ellipsoid, they gradually deform it until the appearance matches the observation. The template-based approaches [19, 16, 17], as well as Pixel2Mesh [36], produce complete 3D meshes.
Learning Monocular Non-Rigid Surface Regression. Only a few supervised learning approaches for 3D reconstruction from monocular images tackle the deformable nature of non-rigid objects. Pumarola et al. [28] and Golyanik et al. [12] train networks for deformation models with synthetic thin plates datasets. Their methods can infer non-rigid states of the observed surfaces such as paper sheets or membranes. The accuracy and robustness of both methods on real images are limited. Bednařík et al. [3] propose an encoder-decoder network for texture-less surfaces relying on shading cues. They train on a real dataset and show an enhanced reconstruction accuracy on real images, but support only trained object classes. Fuentes-Jimenez et al. [9] train a network to deform an object template for depth map recovery. They achieve impressive results on real image sequences but require an accurate 3D model of every object in the scene, which restricts the method’s practicality.

One of the applications of DEMEA is the recovery of texture-less surfaces from RGB images. Since a depth map as a data modality is closer to images with shaded surfaces, we train DEMEA in the depth-to-mesh mode on images instead of depth maps. As a result, we can regress surface geometry from shading cue.

3 Approach

In this section, we describe the architecture of the proposed DEMEA. We employ an expert-designed embedded deformation layer to decouple the complexity of the learned deformation field from the actual mesh resolution. The deformation is represented relative to a canonical mesh with vertices , and edges . To this end, we define the encoder-decoder on a coarse deformation graph and use the embedded deformation layer to drive the deformation of the final high-resolution mesh, see Fig. 1. Our architecture is based on spectral graph convolutions that are defined on a multi-resolution graph hierarchy. In the following, we describe all components in more detail.

3.1 Mesh Hierarchy

The up- and down-sampling in the convolutional mesh autoencoder is defined over a multi-resolution mesh hierarchy, similar to the CoMA [29] architecture. We compute the mesh hierarchy fully automatically based on quadric edge collapses [11], i.e., each hierarchy level is a simplified version of the input mesh. We employ a hierarchy with five resolution levels, where the finest level is the mesh. Given the multi-resolution graph hierarchy, we define up- and down-sampling operations [29] for feature maps defined on the graph. To this end, during down-sampling, we enforce the nodes of the coarser level to be a subset of the nodes of the next finer level. We transfer a feature map to the next coarser level by a similar sub-sampling operation. The inverse operation, i.e., feature map up-sampling, is implemented based on a barycentric interpolation of close features. During edge collapse, we project each collapsed node onto the closest triangle of the coarser level. We use the barycentric coordinates of this closest point with respect to the triangle’s vertices to define the interpolation weights.

3.2 Embedded Deformation Layer (EDL)

Given a canonical mesh for an object category, we design a corresponding coarse embedded deformation graph, see Fig. 2.

Figure 2: Template meshes and the corresponding embedded deformation graphs.

The deformation graph is used as one of the two levels immediately below the mesh in the mesh hierarchy (depending on the resolution of the graph) of the autoencoder. As the quadric edge collapse algorithm can delete nodes of the embedded graph when computing intermediate levels of the graph hierarchy, we modify the algorithm to ensure that the nodes of the embedded graph are not removed from finer levels. The number of nodes in the deformation graph is kept significantly lower than the mesh resolution, and highly deformable regions (arms and legs in the case of bodies) are assigned relatively more nodes.

Our embedded deformation layer models a space deformation that maps the vertices of the canonical template mesh to a deformed version . Suppose is the embedded deformation graph [32] with canonical nodes and edges , with . The global space deformation is defined by a set of local, rigid, per-graph node transformations. Each local rigid space transformation is defined by a tuple , with being a rotation matrix and being a translation vector. We enforce that and by parameterizing the rotation matrices based on three Euler angles. Each is anchored at the canonical node position and maps every point to a new position in the following manner [32]:


To obtain the final global space deformation , the local per-node transformations are linearly combined:


Here, is the set of approximate closest deformation nodes. The linear blending weights for each position are based on the distance to the respective deformation node [32]. Please refer to the supplemental for more details.

The deformed mesh is obtained by applying the global space deformation to the canonical template mesh . The free parameters are the local per-node rotations and translations , i.e., parameters with being the number of nodes in the graph. These parameters are input to our deformation layer and are regressed by the graph convolutional decoder.

3.3 Differentiable Space Deformation

Our novel EDL is fully differentiable and can be used during network training to decouple the parameterization of the space deformation from the resolution of the final high-resolution output mesh. This enables us to define the reconstruction loss on the final high-resolution output mesh and backpropagate the errors via the skinning transform to the coarse parameterization of the space deformation. Thus, our approach enables finding the best space deformation by only supervising the final output mesh.

3.4 Spectral Graph Convolutions

Our graph encoder-decoder architecture is based on fast localized spectral filtering [7]. Given an

-channel feature tensor

, where the features are defined at the graph nodes, and let denote the -th input graph feature map, we define the -th output graph feature map as follows:


Here, is the Laplacian matrix of the graph and the filters are parameterized using Chebyshev polynomials of order . This leads to -localized filters that operate on the -neighbourhoods of the nodes. The complete output feature tensor, that stacks all feature maps, is denoted as . Each filter is parameterized by coefficients, which in total leads to trainable parameters for each graph convolution layer, see [7]

for more details. We apply the graph convolutions without stride,

i.e., input graph resolution equals output resolution.

3.5 Training

We train our approach end-to-end in Tensorflow

[1] using Adam [18]. As loss we employ a dense geometric per-vertex -loss with respect to the ground-truth mesh. For all experiments, we use a learning rate of and default parameters , , for Adam. We train for epochs for Dynamic Faust, epochs for SynHand5M, epochs for the CoMA dataset and epochs for the Cloth dataset. We employ a batch size of .

3.6 Reconstructing Meshes from Images/Depth

Figure 3: Image/depth-to-mesh pipeline: To train an image/depth-to-mesh reconstruction network, we employ a convolutional image encoder and initialize the decoder to a pre-trained graph decoder.

The image/depth-to-mesh network consists of an image encoder and a mesh decoder, see Fig. 3. The mesh decoder is initialized from the corresponding mesh auto-encoder, the image/depth encoder is based on a ResNet-50 [14]

architecture, and the latent code is shared between the encoder and decoder. We initialize the ResNet-50 component using pre-trained weights from ImageNet

[8]. To obtain training data, we render synthetic depth maps from the meshes. We train with the same settings as for mesh auto-encoding.

3.7 Network Architecture Details

In the following, we provide more details of our encoder-decoder architectures.
Encoding Meshes. Input to the first layer of our mesh encoder is an tensor that stacks the coordinates of all vertices. We apply four down-sampling modules

. Each module applies a graph convolution and is followed by a down-sampling to the next coarser level of the graph hierarchy. After each graph convolution, we apply a ReLU non-linearity. Finally, we take the output of the final module and apply a fully connected layer followed by a ReLU non-linearity to obtain a latent space embedding.

Encoding Images/Depth. To encode images/depth, we employ a 2D convolutional network to map color/depth input to a latent space embedding. Input to our encoder are images of resolution pixels. We modified the ResNet-50 [14] architecture to take single or three-channel input image. We furthermore added two additional convolution layers at the end, which are followed by global average pooling. Finally, a fully connected layer with a subsequent ReLU non-linearity maps the activations to the latent space. Decoding Graphs. The task of the graph decoder is to map from the latent space back to the embedded deformation graph. First, we employ a fully connected layer in combination with reshaping to obtain the input to the graph convolutional up-sampling modules. We apply a sequence of three or four up-sampling modules until the resolution level of the embedded graph is reached. Each up-sampling module first up-samples the features to the next finer graph resolution and then performs a spectral graph convolution (with ), which is then followed by a ReLU non-linearity. Then, we apply three additional graph convolutions, where we apply ReLUs after each of the first two. The latter two of these graph convolutions work with for local refinement. The resulting tensor is passed to our expert-designed embedded deformation layer.

4 Experiments

We evaluate DEMEA quantitatively and qualitatively on several challenging datasets and demonstrate state-of-the-art results for mesh auto-encoding. In Sec. 5, we show reconstruction from RGB images and depth maps and that the learned latent space enables well-behaved interpolation.
Datasets. We demonstrate the generality of DEMEA on experiments with body (Dynamic Faust, DFaust [4]), hand (SynHand5M [24]), textureless cloth (Cloth [3]), and face (CoMA [29]) datasets.

Mesh 1st 2nd 3rd 4th
DFaust [4] 6890 1723 352 88 22
CoMA [29] 5023 2525 632 158 40
SynHand5M [24] 1193 400 100 25 7
Cloth [3] 961 256 100 36 16
Table 1: Number of vertices on each level of the graph hierarchy. Bold levels denote the embedded graph.

Table 1 gives the number of graph nodes used on each level of our hierarchical encoder-decoder architecture. All meshes live in metric space.
DFaust [4]. The training set consists of 28,294 meshes. For the tests, we split off two identities (female 50004, male 50002) and two dynamic performances, i.e., one-leg jump and chicken wings. Overall, this results in a test set with elements. For the depth-to-mesh results, we found the synthetic depth maps from the DFaust training set to be insufficient for generalization, i.e., the test error was high. Thus, we add more pose variety to DFaust for the depth-to-mesh experiments. Specifically, we add randomly sampled poses from the CMU dataset to the training data, where the identities are randomly sampled from the SMPL [23] model (14 female, 14 male). We also add 12 such samples to the test set (6 female, 6 male).
Textureless Cloth [3]. For evaluating our approach on general non-rigidly deforming surfaces, we use the textureless cloth data set of Bednařík et al. [3]. It contains real depth maps and images of a white deformable sheet — observed in different states and differently shaded — as well as ground truth meshes. In total, we select 3,861 meshes with consistent edge lengths. 3,167 meshes are used for training and meshes are reserved for evaluation. For this dataset, we hand-design the entire graph hierarchy, since the canonical mesh is a perfectly flat sheet, which causes the down-sampling method [11] to introduce severe artifacts.
SynHand5M [24]. For the experiments with hands, we take random meshes from the synthetic SynHand5M dataset of Malik et al. [24]. We render the corresponding depth maps. The training set is comprised of meshes, and the remaining meshes are used for evaluation.
CoMA [29]. The training set contains 17,794 meshes of the human face in various expressions [29]. For tests, we select two challenging expressions, i.e., high smile and mouth extreme. Thus, our test set contains 2,671 meshes in total.

4.1 Baseline Architectures

We compare our convolutional architecture with embedded deformation (DEMEA) to a number of strong baselines.
Convolutional Baseline. We consider a version of our proposed architecture, convolutional ablation (CA), where the expert-designed ED layer is replaced by learned upsampling modules that upsample to the mesh resolution. In this case, the local refinement (convolutions with ) occurs on the level of the embedded graph. We also consider modified CA (MCA), an architecture where the local graph convolutions are moved to the end of the network for local refinement on the mesh resolution.
Fully-Connected Baseline. We also consider an almost-linear baseline, FC ablation (FCA) . The input is given to a fully-connected layer, after which a ReLU is applied. The resulting latent vector is decoded using another FC layer that maps to the output space. Finally, we also consider an FCED network where the fully-connected decoder maps to the deformation graph, which the embedded deformation layer (EDL) in turn maps to the full-resolution mesh.

4.2 Evaluation Settings

GL 2.6 8.9 2.4 2.4
Ours 2.2 2.3 2.3 2.4
Table 2: Evaluation of different settings of our network on the test set of DFaust [4] using the latent code of length 128. The numbers are the average vertex errors in .

We first determine the most favorable input type and loss function for the considered architectures. As input, we consider either the full mesh (

IM) or the subset of vertices that is used to define the embedded deformation graph (IG). In addition to our proposed loss function, we consider the graph loss (GL) with the reconstruction loss directly on the graph node positions (where the vertex positions of the input mesh that correspond to the graph nodes are used as ground truth). The GL setting uses the EDL only at test time to map to the full mesh, but not for training. We perform this evaluation for both fully-connected (FC) and graph-convolutional (GC) architectures and train for epochs.

Table 2 shows the quantitative results using the average per-vertex Euclidean error. Using the EDL during training leads to better quantitative results, as the network is aware of the skinning function and can move the graph nodes accordingly. Fully-connected networks perform worse with the mesh as input, perhaps due to the large number of parameters in the network. Graph convolutions by design perform local computations and thus require a much smaller number of free variables during training. In all further experiments, we use graphs as inputs for the fully-connected architectures and meshes as inputs for the convolutional architectures, to choose the strongest baselines. We always use the EDL during training in all further results.

4.3 Evaluations of the Autoencoder

Qualitative Evaluations. Our architecture significantly outperforms the baselines qualitatively on the DFaust and SynHand5M datasets, as seen in Figs. 4 and 5. Convolutional architectures without an embedded graph produce strong artifacts in the hand, feet and face regions in the presence of large deformations. Since EDL explicitly models deformations, we preserve fine details under strong non-linear deformations and articulations of extremities.
Quantitative Evaluations. We compare the proposed DEMEA to the baselines on the autoencoding task, see Table 3.

Figure 4: In contrast to graph-convolutional networks that directly regress vertex positions, our embedded graph layer does not show artifacts. These results use a latent dimension of .
DFaust [4] SynHand5M [24] Cloth [3] CoMA [29]
8 32 128 8 32 128 8 32 128 8 32 128
CA 6.7 3.0 2.6 10.30 4.49 3.76 1.61 0.90 0.72 1.57 0.97 0.87
MCA 8.6 3.4 2.5 9.33 4.55 3.67 1.75 0.82 0.70 1.61 0.99 0.87
Ours 6.6 2.9 2.4 8.97 4.67 3.53 1.34 0.83 0.71 1.49 1.05 0.94
FCA 9.3 3.4 2.2 20.96 7.22 1.44 1.71 0.71 0.44 3.19 1.39 0.75
FCED 8.2 3.1 2.2 20.49 9.13 1.60 1.89 0.69 0.44 3.61 3.19 1.08
Table 3: Average per-vertex errors on the test sets of DFaust (in ), SynHand5M (in ), textureless cloth (in ) and CoMA (in .)

While the fully-connected baselines are competitive for larger dimensions of the latent space, their memory demand increases drastically. On the other hand, they perform significantly worse for low dimensions on all datasets. In this work, we are interested in low latent dimensions, e.g. less than 32, as we want to learn mesh representations that are as compact as possible. In the experiments, we predominantly evaluate with latent dimensions and .

For all datasets, DEMEA outperforms all considered baselines for latent codes of length . For a latent dimension of , the gap in the accuracy shrinks. With an increasing dimensionality of the latent code, competing architectures obtain comparable or better results quantitatively, perhaps because they become capable of fitting to the high-frequency details and noise. On the other hand, we are interested in capturing smooth large non-rigid deformations. The baselines achieve comparable accuracy with different sizes of the latent space for different datasets. In the case of the face dataset [29], the closest baselines are on par even for latent codes of size . Since faces deform locally, is sufficient to capture the deformations around the mean shape using standard architectures. As the latent code becomes larger, fully-connected networks consistently outperform all convolutional architectures, as discussed before.
Comparisons. In extensive comparisons with several competitive baselines, we have demonstrated the usefulness of our approach for autoencoding strong non-linear deformations and articulated motion. Next, we compare DEMEA to the existing state-of-the-art CoMA approach [29]. We train their architecture on all mentioned datasets with a latent dimension of , which is also used in [29]. We outperform their method quantitatively on DFaust ( vs. ), on SynHand5M ( vs. ), and on Cloth ( vs. ). We perform worse on Faces ( vs. ), where the deformations are not large. On the other datasets, the advantage of our explicit EDL formulation is clearly noticeable qualitatively. In Fig. 4, we show that DEMEA avoids many of the artifacts present in the case of [29] and other baselines.

Figure 5: Auto-encoding results on all four datasets. Clockwise, starting from the single image on the left: Ground truth, CA with latent dimension 8, Ours with 8, Ours with 32, CA with 32. Best viewed on a screen.

5 Applications

We show several applications of DEMEA, including image-to-mesh reconstruction and deformation transfer.

5.1 RGB to Mesh

On the Cloth [3] dataset, we show that DEMEA can reconstruct meshes from RGB images. See Fig. 6 for qualitative examples using a latent dimension of .

Figure 6: RGB-to-mesh results on our test set. From left to right: real RGB image, our reconstruction, ground truth.

On our test set, our proposed architecture achieves RGB-to-mesh reconstruction errors of , and for latent dimensions , and , respectively. Bednařík et al. [3], who use a different split than us, report an error of . Moreover, we asked the authors of Hybrid Deformation Model Network (HDM-net) [12] to train their method for regression of textureless surfaces. On their split, HDM-Net achieves an error of after training for 100 epochs using a batch size of 4. Under the same settings, we re-train our approach without pre-training the mesh decoder. Our approach obtains test errors of , and using latent dimensions of , and , respectively.

5.2 Depth to Mesh

For hands and bodies, we demonstrate reconstruction results from single depth images.
Bodies. We train networks with a small latent space dimension of and a larger dimension of . Quantitatively, we obtain errors of and with latent space dimensions of and , respectively, on un-augmented synthetic data. Besides, we also apply our approach to real data, see Fig. 7.

Figure 7: DEMEA on real Kinect depth images. From left to right: depth, our reconstructions with latent dimensions and .

To this end, we found it necessary to augment the depth images with artificial noise to lessen the domain gap. Video results are included in the supplementary.
Hands. DEMEA can reconstruct hands from depth as well, see Fig. 8.

Figure 8: Reconstruction results from synthetic depth images of hands using a latent dimension of . From left to right: depth, our reconstruction, ground truth.

We achieve a reconstruction error of for a latent dimension of and for . Malik  [24] report an error of . Our test set is composed of a random sample of fully randomly generated hands from the dataset, which is very challenging. We use , whereas [24] use images of size .

5.3 Latent Space Arithmetic

Although we do not employ any regularization on the latent space, we found empirically that the network learns a well-behaved latent space. As we show in the supplemental document and video, this allows DEMEA to temporally smooth tracked meshes from a depth stream.
Latent Interpolation. We can linearly interpolate the latent vectors and of a source and a target mesh: . Even for highly different poses and identities, decoding these interpolated latent vectors yields plausible in-between meshes, see Fig. 9.

Figure 9: Interpolation results, from left to right: source mesh, , , , , target mesh.

Deformation Transfer. Furthermore, the learned latent space even allows to transfer poses between different identities on DFaust. Let a sequence of source meshes of person and a target mesh of person be given, where w.l.o.g. and correspond to the same pose. We now seek a sequence of target meshes of person performing the same poses as person in . We encode and into the latent space of the mesh auto-encoder, yielding the corresponding latent vectors and . We define the identity difference and set for . Decoding using the mesh decoder than yields . We show qualitative results in Fig. 10 and in the supplementary.

Figure 10: Deformation transfer from a source sequence to a target identity. The first column shows and .

6 Limitations

While the embedded deformation graph excels on highly articulated, non-rigid motions, it has difficulties accounting for very subtle actions. Since the faces in the CoMA [29]

dataset do not undergo large deformations, our EDL-based architecture does not offer a significant advantage. Similar to all other 3D deep learning techniques, our approach also requires reasonably sized mesh datasets for supervised training, which might be difficult to capture or model. We train our network in an object-specific manner. Generalizing our approach across different object categories is an interesting direction for future work.

7 Conclusion

We proposed DEMEA — the first deep mesh autoencoder for highly deformable and articulated scenes, such as human bodies, hands, and deformable surfaces, that builds on a new differentiable embedded deformation layer. The deformation layer reasons about local rigidity of the mesh and allows us to achieve higher quality autoencoding results compared to several baselines and existing approaches. We have shown multiple applications of our architecture including non-rigid reconstruction from real depth maps and 3D reconstruction of textureless surfaces from images.

Acknowledgments. This work was supported by the ERC Consolidator Grant 4DReply (770784), the Max Planck Center for Visual Computing and Communications (MPC-VCC), and an Oculus research grant.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

    Software available from
  • [2] T. Bagautdinov, C. Wu, J. Saragih, Y. Sheikh, and P. Fua. Modeling facial geometry using compositional vaes. 2018.
  • [3] J. Bednařík, P. Fua, and M. Salzmann. Learning to reconstruct texture-less deformable surfaces. In International Conference on 3D Vision (3DV), 2018.
  • [4] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black. Dynamic FAUST: Registering human bodies in motion. In

    Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • [5] D. Boscaini, J. Masci, E. Rodoià, and M. Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In International Conference on Neural Information Processing Systems (NIPS), pages 3197–3205, 2016.
  • [6] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. CoRR, abs/1312.6203, 2013.
  • [7] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In International Conference on Neural Information Processing Systems (NIPS), NIPS’16, pages 3844–3852, 2016.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • [9] D. Fuentes-Jimenez, D. Casillas-Perez, D. Pizarro, T. Collins, and A. Bartoli. Deep Shape-from-Template: Wide-Baseline, Dense and Fast Registration and Deformable Reconstruction from a Single Image. arXiv e-prints, 2018.
  • [10] L. Gao, J. Yang, Y.-L. Qiao, Y.-K. Lai, P. L. Rosin, W. Xu, and S. Xia. Automatic unpaired shape deformation transfer. ACM Trans. Graph., 37(6):237:1–237:15, 2018.
  • [11] M. Garland and P. S. Heckbert. Surface simplification using quadric error metrics. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’97, pages 209–216, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co.
  • [12] V. Golyanik, S. Shimada, K. Varanasi, and D. Stricker. Hdm-net: Monocular non-rigid 3d reconstruction with learned deformation model. In International Conference on Virtual Reality and Augmented Reality (EuroVR), pages 51–72, 2018.
  • [13] T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In Computer Vision and Pattern Recognition (CVPR), 2018.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
  • [15] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • [16] D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shirazi, F. Maire, and A. Eriksson. Learning free-form deformations for 3d object reconstruction. In Asian Conference on Computer Vision (ACCV), 2018.
  • [17] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning category-specific mesh reconstruction from image collections. In European Conference on Computer Vision (ECCV), 2018.
  • [18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • [19] A. Kurenkov, J. Ji, A. Garg, V. Mehta, J. Gwak, C. Choy, and S. Savarese. Deformnet: Free-form deformation network for 3d shape reconstruction from a single image. In Winter Conference on Applications of Computer Vision, 2018.
  • [20] H. Li, B. Adams, L. J. Guibas, and M. Pauly. Robust single-view geometry and motion reconstruction. In ACM SIGGRAPH Asia, pages 175:1–175:10, 2009.
  • [21] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. In Computer Vision and Pattern Recognition (CVPR), 2018.
  • [22] M. Loper, N. Mahmood, and M. J. Black. Mosh: Motion and shape capture from sparse markers. ACM Trans. Graph., 33(6):220:1–220:13, 2014.
  • [23] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM Trans. Graph., 34(6):248:1–248:16, 2015.
  • [24] J. Malik, A. Elhayek, F. Nunnari, K. Varanasi, K. Tamaddon, A. Héloir, and D. Stricker. Deephps: End-to-end estimation of 3d hand pose and shape by learning from synthetic depth. 2018 International Conference on 3D Vision (3DV), pages 110–119, 2018.
  • [25] J. Masci, D. Boscaini, M. M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In International Conference on Computer Vision Workshop (ICCVW), pages 832–840, 2015.
  • [26] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. pages 5425–5434, 07 2017.
  • [27] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In International Conference on Machine Learning (ICML), volume 48, pages 2014–2023, 2016.
  • [28] A. Pumarola, A. Agudo, L. Porzi, A. Sanfeliu, V. Lepetit, and F. Moreno-Noguer. Geometry-aware network for non-rigid shape prediction from a single view. In Computer Vision and Pattern Recognition (CVPR), 2018.
  • [29] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Generating 3D faces using convolutional mesh autoencoders. In European Conference on Computer Vision (ECCV), pages 725–741, 2018.
  • [30] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani. Surfnet: Generating 3d shape surfaces using deep residual networks. In Computer Vision and Pattern Recognition (CVPR), 2017.
  • [31] O. Sorkine and M. Alexa. As-rigid-as-possible surface modeling. In Eurographics Symposium on Geometry Processing (SGP), pages 109–116, 2007.
  • [32] R. W. Sumner, J. Schmid, and M. Pauly. Embedded deformation for shape manipulation. In ACM SIGGRAPH, 2007.
  • [33] Q. Tan, L. Gao, Y.-K. Lai, and S. Xia. Variational autoencoders for deforming 3d mesh models. In Computer Vision and Pattern Recognition (CVPR), 2018.
  • [34] Q. Tan, L. Gao, Y.-K. Lai, J. Yang, and S. Xia. Mesh-based autoencoders for localized deformation component analysis. In AAAI, 2018.
  • [35] N. Verma, E. Boyer, and J. Verbeek. FeaStNet: Feature-Steered Graph Convolutions for 3D Shape Analysis. In Computer Vision and Pattern Recognition (CVPR), pages 2598–2606, 2018.
  • [36] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In European Conference on Computer Vision (ECCV), 2018.
  • [37] L. Yi, H. Su, X. Guo, and L. Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In Computer Vision and Pattern Recognition (CVPR), pages 6584–6592, 2017.
  • [38] M. Zollhöfer, M. Nießner, S. Izadi, C. Rhemann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, and M. Stamminger. Real-time non-rigid reconstruction using an rgb-d camera. ACM Transactions on Graphics (TOG), 33(4), 2014.