Neural 3D Morphable Models: Spiral Convolutional Networks for 3D Shape Representation Learning and Generation

05/08/2019 ∙ by Giorgos Bouritsas, et al. ∙ Imperial College London USI Università della Svizzera italiana 26

Generative models for 3D geometric data arise in many important applications in 3D computer vision and graphics. In this paper, we focus on 3D deformable shapes that share a common topological structure, such as human faces and bodies. Morphable Models were among the first attempts to create compact representations for such shapes; despite their effectiveness and simplicity, such models have limited representation power due to their linear formulation. Recently, non-linear learnable methods have been proposed, although most of them resort to intermediate representations, such as 3D grids of voxels or 2D views. In this paper, we introduce a convolutional mesh autoencoder and a GAN architecture based on the spiral convolutional operator, acting directly on the mesh and leveraging its underlying geometric structure. We provide an analysis of our convolution operator and demonstrate state-of-the-art results on 3D shape datasets compared to the linear Morphable Model and the recently proposed COMA model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The success of deep learning in computer vision and image analysis, speech recognition, and natural language processing, has driven the recent interest in developing similar models for 3D geometric data. Generalizations of successful architectures such as convolutional neural networks (CNNs) to data with non-Euclidean structure (e.g. manifolds and graphs) is known under the umbrella term

Geometric deep learning [8]. In applications dealing with 3D data, the key challenge of geometric deep learning is a meaningful definition of intrinsic operations analogous to convolution and pooling on meshes or point clouds. Among numerous advantages of working directly on mesh or point cloud data is the fact that it is possible to build invariance to shape transformations (both rigid and nonrigid) into the architecture, as a result allowing to use significantly simpler models and much less training data. So far, the main focus of research in the field of geometric deep learning has been on analysis tasks, encompassing shape classification and segmentation [32, 34], local descriptor learning, correspondence, and retrieval [30, 7, 27]

; most of the aforementioned problems are posed as supervised learning.

On the other hand, there has been limited progress in geometric representation learning and generation of geometric data (shape synthesis). Obtaining descriptive and compact representations of meshes and point clouds is crucial to achieve efficient compression as well as for downstream tasks such as classification and 3D reconstruction. The ability to generate geometric data is pivotal in applications such as 3D printing, computer graphics and animation, virtual reality, and game design, and can heavily assist graphics designers and speed-up production. In addition, given the high cost and time of acquiring quality 3D data, geometric generative models can be used as a cheap alternative for producing training data for geometric ML algorithms.

Most of the previous approaches for geometric data generation rely on some intermediate representations of 3D shapes, such as point clouds [1], voxels [41], UV maps, or differentiable rendering [38], and not on direct surface representations as meshes. Despite the success of such techniques in certain cases, all of them suffer from either high computational complexity (voxels) or absence of smoothness of the resulting data (point clouds). Litany et al. [24] introduced an intrinsic mesh autoencoder architecture for shapes with fixed topology and demonstrated high-quality shape completion results. In [35], a similar approach was used to construct a non-linear facial morphable model. This method was built upon spectral convolutional operators originally developed for graphs [12], where local permutation invariance limits the filter representation power (spectral filters are isotropic).

In this paper, we propose a novel representation learning and generative framework for fixed topology 3D shapes. For this purpose, we formulate a spiral convolution operator that acts directly on the mesh, extending the methodology proposed in [23]. In particular, similarly to image convolutions, we construct an intrinsic operator that defines a local neighborhood around each vertex on the mesh, by enforcing an explicit ordering of the neighbors via a spiral scan. This way, we allow for different treatment of each neighbor, yielding anisotropic filters. We motivate the generality of the operator for domains where there exists an implicit ordering, by showing that it can be applied on arbitrary mesh topologies and point clouds, and by demonstrating its equivalence to traditional grid convolutions. With this approach, operations such as dilated convolutions can be easily formulated on meshes.

We use spiral convolution as a basic building block for hierarchical intrinsic mesh autoencoders and GAN architectures. We use our intrinsic autoencoder to build a nonlinear Neural 3D Morphable Model (Neural3DMM) capable of learning representations and generating 3D shapes from various categories. We quantitatively evaluate our methods on several popular datasets: human faces with different expressions (COMA [35]) and identities (Mein3D [5]) and human bodies with pose variation (DFAUST [4]). Our model achieves state-of-the-art reconstruction results, outperforming the widely used linear 3D Morhpable Model [3] and the COMA autoencoder [35]. We also qualitatively assess our framework showing ‘shape arithmetics’ in the latent space of the autoencoder. Finally, we show novel realistic unseen facial identities generated by our mesh GAN.

2 Related Work

Generative models for arbitrary shapes. Perhaps the most common approaches for generating arbitrary shapes are volumetric CNNs [42, 33, 28] acting on 3D voxels. In [18], the authors propose to jointly learn to autoencode 3D voxel occupancy grids and regress to them from images. Sharma et al. [37]

introduced a voxel denoising autoencoder. Wu et al.

[41] proposed a voxel-GAN to generate novel shapes. Among the key drawbacks of volumetric methods are their inherent high computational complexity and coarse representation that can alter the topological structure of the shapes. Point clouds are a simple and lightweight alternative to volumetric representation recently gaining popularity. Several methods have been proposed for representation learning of fixed-size point clouds [1] using the Pointnet [32] architecture. In [43], point clouds of arbitrary size were addressed. In [15] and [26], point cloud reconstruction from single images was performed. Despite their compactness, point clouds are not popular for realistic and high-quality 3D geometry generation due to the absence of neighborhood connectivity and lack of an underlying smooth structure.

Morphable models. In the case of deformable shapes, such as faces, bodies , where a fixed topology can be obtained by establishing dense correspondences, the most popular methods are still statistical models given their simplicity and, in most of the cases, validity of the assumptions they make. For Faces, the baseline is the PCA-based 3D Morphable Model (3DMM) [3]. Regarding facial identity, the Large Scale Face Model (LSFM) [5] was proposed and made publicly available, while a large scale model of the entire head was proposed in [31]. Regarding facial expression similar methods have been presented in [10, 22]. For Body & Hand, the most well known body model is the SMPL model [25], a skinned vertex-based model that is parametrized by identity-dependent blend shapes (learned through PCA), blend weights (manually defined and then fined tuned) and pose-dependent blend shapes. A similar model for the hands (MANO) has also been recently proposed [36]. SMPL and MANO models are non-linear and require (a) the localization of the joints and (b) solving special optimisation problems in order to project a new shape to the space of the models. In this paper, we take a different approach introducing a new family of fully differentiable Morphable Models, dubbed Neural3DMM, which can be applied on a variety of objects, with strong (i.e., body) and less strong (i.e., face) articulations. Our methods have better representation power being both non-linear and hierarchical, and also do not require any additional supervision.

Geometric Deep Learning is a set of recent methods trying to generalize neural network architectures to non-Euclidean domains such as graphs and manifolds [8]. Such methods have achieved very promising results in geometry processing and computer graphics [27, 7, 30], computational chemistry and drug design [14, 17], and network science [20, 30]. Multiple approaches have been proposed to construct convolution-like operations on graphs and meshes, including spectral methods [9, 13, 20, 44], local charting based [27, 7, 30, 16, 40, 21, 23] and graph attention [39]. Finally, in order to aggregate local information, graph or mesh coarsening techniques [13, 45] have been proposed, equivalent to image pooling.

3 Spiral Convolutional Networks

3.1 Spiral Convolution

For the following discussion, we assume to be given a manifold (surface), discretized as a triangular mesh where , , and denote the sets of vertices, edges, and faces, respectively. We furthermore assume to be given a function representing the vertex-wise features.

One of the key challenges in developing convolution-like operators on graphs or manifolds is the lack of a vector space structure and a global system of coordinates that can be associated with each point. The first intrinsic mesh convolutional architectures such as GCNN

[27], ACNN [7] or MoNet [30] overcame this problem by constructing a local system of coordinates around each vertex of the mesh, in which a set of local weighting functions is applied to aggregate information from the neighbor vertices . This allows to define ‘patch operators’ generalizing the sliding window filtering in images,

(1)

here are ‘soft pixels’ ( in total), are akin to pixel intensity in images, and are the filter weights. In the Euclidean setting, such operators boil down to the classical convolution, since the local systems of coordinates can be identified via the global coordinate system.

One difficulty in the construction of patch operators is their consistency, insensitivity to meshing. Ideally, a system of coordinates

constructed on a manifold should be independent on the way the manifold is discretized. A second difficulty is the lack of a canonical reference frame; in particular, a patch on a surface can be oriented arbitrarily. Several mechanisms have been proposed to deal with the latter issues, including angular max-pooling

[27].

A crucial observation we make in this paper is that these issues are irrelevant for generating shapes with fixed topology. Assuming that the mesh structure is fixed, constructing a patch operator amounts to ordering of the neighbor points,

(2)

where denote the neighbors of vertex ordered in some fixed way. In the Euclidean setting, this order is simply a raster scan of pixels in a patch. On meshes, we opt for a simple and intuitive ordering using spiral trajectories inspired by [23].

Let be some mesh vertex, and let be the -ring, an ordered set of vertices whose shortest (graph) path to is exactly hops long; denotes the th element in the -ring (trivially, ). We define the spiral patch operator as the ordered sequence

(3)

where

denotes the patch radius (number of hops), acting as a hyperparameter similar to the size of the kernel in classical CNNs.

The uniqueness of the ordering is given by fixing two degrees of freedom: the direction of the rings—whether they are ordered clockwise or counterclockwise, and the first vertex

. These degrees of freedom are set by selecting the first two vertices in the spiral; the rest are ordered inductively. In order to allow for fixed-sized spirals, we truncate their length to a fixed length and do zero-padding for the vertices that have smaller length than the chosen one. We define

spiral convolution as

(4)

The application of the spiral ordering is illustrated in Fig 2.

Figure 2: Spiral ordering on a mesh and an image patch

Comparison to Lim [23]

: The authors choose the starting point of each spiral at random, for every mesh sample, every vertex, and every epoch during training. This choice prevents them from being equivalent to image convolutions and cancels out the nice properties of the ordering. In particular, in order for the ordering to be consistent, it needs to be defined based on a local coordinate system that will be repeatable across meshes (mesh consistency) and grounded on the underlying geometry of the shape (vertex consistency). If one of the two properties is missing, then while training, the parameters might struggle to adapt to the large variability of the coordinate systems and thus the representational power diminishes. Especially the choice of a random starting point per epoch increases the possibility of learning rotation invariant filters, where the parameters acting on the vertices in the same ring will be approximately the same.

To overcome this, we emphasize the importance of the choice of the starting point. In particular, for shapes with arbitrary topology, a repeatable local reference should be chosen in order to allow for mesh patches with similar structure to be fed with the same ordering to the convolutional operator. For fixed topologies, this property can be easily obtained since we can retrieve the same ordering by choosing the same vertex index for every mesh. To make our methods more robust we chose the starting point based on the underlying geometry of the shape. In particular, we fix a reference vertex on the mesh and choose the initial point for each spiral to be in the direction of the shortest geodesic path to , .

(5)

where is the geodesic distance between two vertices on the mesh . In this way, we achieve both vertex consistency and make our method generalizable.

Moreover, in [23], the authors model the vertices on the spiral via a recurrent network, which, besides its higher computational complexity, does not take advantage of the stationary properties of the 3D shape (local statistics are repeated across different patches) which is efficiently treated through weight sharing of our spiral kernel.

Comparison to spectral filters: Spectral convolutional operators developed in [12, 20] for graphs and used in [35] for mesh autoencoders, suffer from the fact that the are inherently isotropic. The reason is that on a general graph, there is no canonical ordering of the neighbor vertices and one has to resort to permutation-invariant operators. Spectral filters rely on the Laplacian operator, which performs weighted averaging of the neighbor vertices

(6)

where denotes a weight computed from edge length. A polynomial of degree with learnable coefficients is then applied to ,

(7)

amounting to filtering the Laplacian eigenvalues,

.

While a necessary evil in general graphs, spectral filters on meshes are rather weak: they are locally rotationally-invariant, , have a fixed value in each ring. On the other hand, spiral convolutional filters leverage the fact that on a mesh one can canonically order the neighbors (up to selection of the orientation and reference direction); consequently, our filters are inherently anisotropic and can be expressive with just one hop .

Figure 3: Illustration of the activations produced by ChebNet compared to Spiral convolutions

In Fig 3 we visualize the impulse response (centered on a vertex on the forehead) of a selected filter from COMA’s chebyshev polynomials based mesh convolutions (left) and from a spiral convolutional filter with (right). Notice the isotropic diffusion and the larger receptive field on the left (the vertex features are chosen to be the per vertex deformations from the mean shape).

Furthermore, the equivalence of spiral sonvolution to image convolutions allows the use of long-studied practices in the computer vision community. For example, small patches can be used, leading to few parameters and fast computation. Furthermore, dilated convolutions [46] can also be adapted to the spiral operator by simply sub-sampling the spiral. Finally, we argue here that our operator could be applied to other domains, were an ordering of the primitives can be enforced, such as point clouds.

3.2 Neural 3D Morphable Models

Let , the matrix containing all the signals defined on a set of meshes in dense correspondence that are sampled from a distribution , where the dimensionality of the signal on the mesh (vertex position, texture etc.) and the number of vertices. A linear 3D Morphable Model [3] represents arbitrary instances

as a linear combination of the k largest eigenvectors of the covariance matrix of

:

(8)

where the mean shape, is the th principal component, and the respective linear coefficient. Given its linear formulation, the representational powers of the 3DMM are constrained by the span of the vectors, while its parameters scale linearly w.r.t the number of the eigencomponents used, leading to large parametrizations for meshes of highly resolution.

In contrast, in this paper, we use spiral convolutions as a building block to build a fully differentiable non-linear Morphable Model that learns how to represent and synthesize 3D shapes by acting directly on the mesh. In essence, a Neural 3D Morphable Model is a deep convolutional autoencoder, that learns hierarchical representations of a shape. In particular, the encoder is responsible for projecting the mesh to an unstructured latent representation, from which the decoder attempts to reconstruct the original input. An illustration of the architecture can be found at Fig 1.

Similar to traditional convolutional autoencoders, we make use of series of convolutional layers with small receptive fields followed by pooling and unpooling, for the encoder and the decoder respectively, where a decimated or upsampled version of the mesh is obtained each time and the features of the existing vertices are either aggregated or extrapolated. We follow [35]

for the calculation of the features of the added vertices after upsampling, through interpolation by weighting the nearby vertices with barycentric coordinates. The network is trained by minimizing the

norm between the input and the predicted output.

The hierarchical nature of our model, allows for efficient learning in multiple scales. As a result, the network learns to compress the 3D data in a more efficient and semantically meaningful way, while at the same time keeping the parameter space at a considerably lower dimension, taking advantage of the stationarity of the shape through localized convolutions.

3.3 Spiral Convolutional GAN

In order to improve generation of meshes with high resolution, thus increased detail, we extend our framework with a distribution matching scheme. In particular, we propose a mesh Wasserstein GAN [2], that is trained to minimize the wasserstein divergence between the real distribution of the meshes and the distribution of those produced by the generator network. Following the work of Gulrajani [19]

, we enforced the Lipschitz constraint on the critic network that estimates the divergence, by adding a gradient penalty in its loss function. The architectures of the generator and the discriminator, have the same structure as the decoder and the encoder of the Neural3DMM respectively. Notice here that the presence of the fixed topology allows us to predefine the upsampling matrices for the unpooling layers of the generator. Via this framework, we obtain two additional properties that are inherently absent from an autoencoder architecture: high frequency detail and a straightforward way to sample from the latent space. As a result, by sampling the generator’s latent space we are able to produce realistically looking examples, which in this case are novel, unseen during training, facial identities.

4 Evaluation

In this section, we showcase the effectiveness of our proposed methods on a variety of datasets in terms of shape category. We demonstrate the inherent higher representation power of spiral convolutions compared to ChebNet using the same architecture. Additionally, we quantitatively show that our method can yield better representations than the linear 3DMM and COMA, while maintaining a small amount of total parameters and frequently a more compact latent representation, by comparing their average reconstruction error. Moreover, we proceed with a qualitative evaluation of our method, by analysing its advantages in reconstructing exemplar meshes and by generating novel examples through vector space arithmetics. Finally, we assess our intrinsic GAN in terms of its ability to produce high resolution realistic examples.

We compare the following methods:

  • PCA: 3D Morphable Model [3]

  • COMA: ChebNet-based Convolutional Mesh Autoencoder [35]. For the high resolutional meshes of the MeIn3D dataset, we modified the architecture with an additional convolutional and downsampling/upsampling layer.

  • Neural3DMM (small): Ours spiral convolution autoencoder, where we used the same architecture as in COMA replacing the spectral filter with the spiral one.

  • Neural3DMM (ours): Our proposed Neural3DMM framework, where we enhanced our model with a larger parameter space (see Sec. 4.2).

For all the cases, we choose as signal on the mesh the normalized deformations from the mean shape, for every vertex we subtract its mean position and divide with the standard deviation. In this way, we facilitate the optimization while training the networks. Given that the absence of normalization with the standard deviation yields similar results for the 3DMM, we follow the same practice here as well, for consistency and a fair comparison.

4.1 Datasets

COMA. The facial expression dataset from Ranjan [35], consisting of 20,000+ 3D scans in dense correspondence of twelve unique identities performing twelve types of extreme facial expressions. Each mesh has a vertex count of 5023. We used the same test, validation, and train split as was used in [35]: 2050 for test set, 100 for validation set, and 18,000+ for train set.

DFAUST. The human body shape dataset from Bogo [4], consisting of 40,000+ 3D scans (6890 vertices) of ten unique identities performing various actions such as leg and arm raises, jumps, etc. A dense correspondence field is already established on the used meshes. We split the data into a test set of 5000, validation set of 500, and train set of 34,500+ at random.

MeIn3D. The 3D large scale facial identity dataset from Booth [6], consisting of more than 10,000 distinct identity scans with 28K vertices which cover a wide range of gender ethnicity and age. For the subsequent experiments the MeIn3D dataset was randomly split within demographic constraints to ensure a gender, ethnic and age diversity into 9000 and 1000 meshes which were used for training and test purposes respectively.

Figure 4: Comparison of Spiral and Spectral filters with the exact same architecture
Figure 5: Quantitative evaluation of our Neural3DMM against the baselines, in terms of generalization and # of parameters

For the quantitative experiments of sections 4.3 and 4.4

the evaluation metric used is

generalization, which measures the ability of a model to represent novel shapes from the same distributions as they were trained on. More specifically we evaluate the per sample and per vertex euclidean distance in 3D space between corresponding vertices in the input and its reconstruction.

Figure 6: Color coding of the per vertex euclidean error of the reconstructions produced by PCA (2nd), COMA (3rd), and our Neural3DMM (bottom). Top row is ground truth.

4.2 Implementation Details

We denote as a spiral convolution of hops and filters, and a downsampling and an upsampling by a factor of , respectively, a fully connected layer, the number of vertices after the last downsampling layer. The simple Neural3DMM for COMA and DFAUST datasets is the following:

For Mein3D, due to the high vertex count, we modified the COMA architecture for our simple Neural3DMM by adding an extra convolution with 8 filters and an extra downsampling/upsampling layer at the beginning and the end. The larger Neural3DMM follows the above architecture, but with an increased parameter space. For COMA, the convolutional filters of the encoder had sizes [64,64,64,128] and for Mein3D the sizes were [8,16,32,64,128], while the decoder is a mirror of the encoder. For DFAUST, the sizes were [16,32,64,128] and [128,64,32,32,16] and dilated convolutions with hops and dilation ratio

were used for the first and the last two layers of the encoder and the decoder respectively. We observed that by adding an additional convolution at the very end (of size equal to the size of the input feature space), training was accelerated. All of our activation functions were ELUs

[11]. Our learning rate was with a decay of after each epoch, and our weight decay was . All models were trained for 300 epochs which was sufficient for convergence.

4.3 Isotropic vs Anisotropic Convolutions

For the purposes of this experiment we used the architecture deployed by the authors of [35]. The number of parameters in our case is slightly larger due to the fact that the immediate neighbors, that affect the size of the spiral, range from 7 to 10, while the polynomials used in [35] go up to the 6th power of the Laplacian. For the COMA dataset, as clearly illustrated in Fig 4, spiral convolution based autoencoders consistently outperform the spectral ones for every latent dimension, in accordance with the analysis made in section 3.1. In the DFAUST dataset, the results follow the same trend, with the sole exception of 8 dimensions, where the two models’ performance is on par, possibly because the dimensionality of the latent space constrains the expressivity of both models. Increasing the latent dimensions, our model’s performance increases at a higher rate than its counterpart. Notice that the number of parameters scale the same way as the latent size grows, but the spiral model makes better use of the added parameters especially looking at dimensions 16, 32, 64, and 128. Especially on the COMA dataset, the spectral model seems to be flattening between 64 and 128 while the spiral is still noticeably decreasing.

4.4 Neural 3D Morphable models

4.4.1 Quantitative results

In this section, we compare our Neural3DMM against PCA, COMA and Simple Neural3DMM baselines, on different dimensionalities of the latent space. These were chosen based on the variance explained by PCA; for all datasets the smallest dimension on each plot corresponds to an explained variance of roughly 85% of the total variance, the middle dimension corresponds to roughly 95%, and the highest corresponds to 99%.

As can be seen from the graphs in Fig 5

, our Neural3DMM achieves smaller generalization errors in every case it was tested on. For the COMA and DFAUST datasets all hierarchical intrinsic architectures outperform PCA for small latent sizes. That should probably be attributed to the fact that the localized filters used allow for effective reconstruction of smaller patches of the shape, such as arms and legs (for the DFAUST case), whilst PCA attempts a more global reconstruction, thus its error is distributed equally across the entire shape. This is more thoroughly addressed in the qualitative evaluation subsection. Regarding the comparison between the hierarchical architectures, it is again apparent here that our spiral based autoencoder has increased capacity, both thanks to the convolutional kernel, which makes our simple Neural3DMM surpass COMA, and thanks to the increased parameter space, which makes our larger Neural3DMM outperform the other methods by a considerably large margin in terms of both generalization and compression (notice that for COMA our method with 8 latent dimensions outperforms COMA for 128 - 0.54mm error vs 0.59). Despite the fact that for higher dimensions, PCA can explain more than 99% of the total variance, thus making it a tough-to-beat baseline, our larger model still manages to outperform it. The main advantage here is the substantially smaller number of parameters of which we make use. This is clearly seen in the comparison for the MeIn3D dataset, where the large vertex count makes non-local methods as PCA impractical. It is necessary to mention here, that larger latent space sizes are undesirable for an autoencoder, given that they contradict with our goal to achieve high compression of the 3D shape, not to mention that they might lead to less semantically meaningful representations, thus preventing the encodings to be used for downstream tasks and making sampling more adverse. All the exact values can be found in the tables provided in the supplementary material.

4.4.2 Qualitative results

Figure 7: Analogies in MeIn3D and DFAUST
Figure 8: Interpolations between sufficiently different expressions and identities

In Fig 6 we compare exemplar reconstructions of samples from the test set with latent size of 16 (96% explained variance by PCA). It is clearly visible that PCA prioritizes body shape over pose resulting to body parts in the wrong locations (for example see the right leg of the woman on the leftmost column). On the contrary COMA places the vertices in approximately correct locations, but struggles to recover the fine details of the shape leading to various artifacts and deformities; our model on the other hand seemingly balances these two difficult tasks resulting in quality reconstructions that preserve pose and shape. Moreover, we assess the representational power of our models by the common practice of testing their ability to perform linear algebra in their latent spaces.

Interpolation We choose two sufficiently different samples and from our test set, encode them in their latent representations and and then produce intermediate encodings by sampling the line that connects them , where . We visualize the respective decoded shapes in Fig 8. As can be clearly seen the generated samples produce a smooth transition between the two outer examples.

Extrapolation Similarly, we decode latent representations that reside on the line defined by and , but outside the respective line segment, , where . We choose to be our neutral expression for COMA and neutral pose for DFAUST, in order to showcase the exaggeration of a specific characteristic on the shape (Fig 9).

Shape Analogies We choose three meshes , , , and check if our models can be used to easily construct a such that it satisfies :::: using linear algebra in the latent space as in [29]. Let map a mesh to its latent encoding, equivalent relations can then be expressed as where we then solve for with the help of the decoder. Some examples can be seen in Fig 7, where we generate novel unseen examples by transferring a specific characteristic using meshes from our dataset.

Figure 9: Extrapolation in COMA and DFAUST. Leftmost column: the neutral expression/pose

4.5 GAN evaluation

Figure 10: Generated identities from our intrinsic 3D GAN

In figure 10, we sampled several faces from the latent distribution of the trained generator. Notice that they are realistically looking and, following the statistics of the dataset, span a large proportion of the real distribution of the human faces, in terms of ethnicity, gender and age. Compared to the most popular approach for synthesizing faces, the 3DMM, our model learns to produce fine details on the facial structure, making them hard to distinguish from real 3D scans, whereas the 3DMM, although it produces smooth surfaces, frequently makes it easy to tell the difference between real and artificially produced samples. We direct the reader to the supplementary material to compare with samples drawn from the 3DMM’s latent space.

5 Conclusion

In this paper we introduced a representation learning and generative framework for fixed topology 3D deformable shapes, by making use of a mesh convolutional operator—spiral convolutions—used as a building block for mesh autoencoder and GAN architectures. We showcased the inherent representational power of the operator by comparing it to a previously used isotropic graph convolutional operator and show that our mesh autoencoders achieve state-of-the-art results in mesh reconstruction. Finally, we present the generation capabilities of our models through vector arithmetics in the autoencoder’s latent space, as well as by using our GAN to synthesize realistic, novel facial identities. Regarding future work, we plan to extend our framework to generative models for 3D shapes of arbitrary topology, as well as to other domains that have capacity for an implicit ordering of their primitives, such as point clouds.

References

  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds.

    International Conference on Machine Learning (ICML)

    , 2018.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
  • [3] V. Blanz, T. Vetter, et al. A morphable model for the synthesis of 3d faces. In SIGGRAPH, 1999.
  • [4] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black. Dynamic FAUST: Registering human bodies in motion. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • [5] J. Booth, A. Roussos, A. Ponniah, D. Dunaway, and S. Zafeiriou. Large scale 3d morphable models. International Journal of Computer Vision (IJCV), 2018.
  • [6] J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway. A 3d morphable model learnt from 10,000 faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [7] D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2016.
  • [8] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • [9] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014.
  • [10] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 2014.
  • [11] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). CoRR, 2015.
  • [12] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in Neural Information Processing Systems (NIPS), 2016.
  • [13] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NIPS), 2016.
  • [14] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing systems (NIPS), 2015.
  • [15] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [16] M. Fey, J. E. Lenssen, F. Weichert, and H. Müller. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [17] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
  • [18] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In European Conference on Computer Vision (ECCV). Springer, 2016.
  • [19] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • [20] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • [21] I. Kostrikov, Z. Jiang, D. Panozzo, D. Zorin, and J. Bruna. Surface networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [22] T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 2017.
  • [23] I. Lim, A. Dielen, M. Campen, and L. Kobbelt. A simple approach to intrinsic correspondence learning on unstructured 3d meshes. Proceedings of the European Conference on Computer Vision Workshops (ECCVW), 2018.
  • [24] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1886–1895, 2018.
  • [25] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
  • [26] P. Mandikal, N. Murthy, M. Agarwal, and R. V. Babu. 3d-lmnet: Latent embedding matching for accurate and diverse 3d point cloud reconstruction from a single image. British Machine Vision Conference (BMVC), 2018.
  • [27] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), pages 37–45, 2015.
  • [28] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015.
  • [29] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS). 2013.
  • [30] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [31] S. Ploumpis, H. Wang, N. Pears, W. A. Smith, and S. Zafeiriou. Combining 3d morphable models: A large scale face-and-head model. arXiv preprint arXiv:1903.03785, 2019.
  • [32] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [33] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [34] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems (NIPS), pages 5099–5108, 2017.
  • [35] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Generating 3d faces using convolutional mesh autoencoders. Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [36] J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 2017.
  • [37] A. Sharma, O. Grau, and M. Fritz. Vconv-dae: Deep volumetric shape learning without object labels. In European Conference on Computer Vision, pages 236–250. Springer, 2016.
  • [38] L. Tran and X. Liu. Nonlinear 3d face morphable model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [39] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.
  • [40] N. Verma, E. Boyer, and J. Verbeek. Feastnet: Feature-steered graph convolutions for 3d shape analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 2598–2606, 2018.
  • [41] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems (NIPS), 2016.
  • [42] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [43] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [44] L. Yi, H. Su, X. Guo, and L. J. Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [45] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems (NIPS), 2018.
  • [46] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. International Conference on Learning Representations (ICLR), 2016.